Line breaks and spaces being ignored in PDF->HTML

SteveZesty · June 5, 2024, 1:34pm

We’re using Aspose Words 24.5 to convert DOC and PDF files to HTML, but some line breaks and some spaces aren’t being recognised.

It’s happening across many different document templates we’re working with, but here’s a single example. The original PDF looks like this:

And the resulting HTML looks like this - note the “BY EMAIL” text isn’t on its own line, and the space between the dates has disappeared:

Here is the PDF:
letter-minister-for-health-and-social-services-to-assisted-dying-review-panel-re-questions-on-ad-2-april-2024.pdf (199.6 KB)

Can anyone offer any guidance on fixing that?

Thanks,
Steve.

alexey.noskov · June 5, 2024, 1:57pm

@SteveZesty You should note that PDF and HTML documents are different by their concept. PDF is fixed page format, while HTML is flow format. If it is required to preserve original PDF document layout, you should use FixedHtml Format. Please see the following code:

Aspose.Words.Pdf2Word.FixedFormats.PdfFixedRenderer pdfRenderer = new Aspose.Words.Pdf2Word.FixedFormats.PdfFixedRenderer();
Aspose.Words.Pdf2Word.FixedFormats.PdfFixedOptions opt = new Aspose.Words.Pdf2Word.FixedFormats.PdfFixedOptions();
using (FileStream pdfStream = File.OpenRead(@"C:\Temp\in.pdf"))
using (Stream htmlStream = pdfRenderer.SavePdfAsHtml(pdfStream, opt))
using (FileStream htmlFileStream = File.Create(@"C:\Temp\out.html"))
{
    htmlStream.CopyTo(htmlFileStream);
}

Here is the produced output: out.zip (146.8 KB)

SteveZesty · June 5, 2024, 2:29pm

Thanks for the reply, Alexey.

The output will be used on a modern responsive website, and that version has every letter absolutely positioned on the page:

So that wouldn’t work for us. The vast majority of the spaces and line breaks work fine the way we’re converting it currently - do you know why a small minority of them don’t?

What we have is (bearing in mind the documents can be DOC or PDF):

Aspose.Words.Saving.HtmlSaveOptions saveOptions = new Aspose.Words.Saving.HtmlSaveOptions(Aspose.Words.SaveFormat.Html);
saveOptions.CssStyleSheetType = Aspose.Words.Saving.CssStyleSheetType.Embedded;
saveOptions.ExportImagesAsBase64 = true;
saveOptions.ExportHeadersFootersMode = Aspose.Words.Saving.ExportHeadersFootersMode.None;
var doc = new Aspose.Words.Document(url); // url is the URL of the doc in the CMS
var html = doc.ToString(saveOptions);

Thanks again,
Steve.

alexey.noskov · June 5, 2024, 6:58pm

@SteveZesty Please note, Aspose.Words is designed to work with MS Word documents. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model. On the other hand PDF documents are fixed page format documents . While loading PDF document, Aspose.Words converts Fixed Page Document structure into the Flow Document Object Model. Unfortunately, such conversion does not guaranty 100% fidelity.

The same applies to HTML format, which also has it’s object model very different from MS Word document object model. So in your code you are conversion one “not-native” format to another “not-native” format. I am afraid such conversion will never give 100% fidelity.