Issue Restoring Text and Images Position When Converting PDF to Word and Back

gabriel.vega · December 3, 2024, 3:04pm

Hello team,

I am facing an issue with text alignment and positioning during a PDF-to-Word-to-HTML and back-to-PDF conversion process. Below is the workflow I am following:

Initial Step:

I save the margins and dimensions of each page in the original PDF document.

Conversion to Word and HTML:

The PDF is converted to Word, then to HTML for editing.

Reconversion to Word and Back to PDF:

After editing, I convert the HTML back to Word, and then to PDF.
During this step, I restore the original page sizes (width and height) using the dimensions saved initially.

The problem arises when I try to restore the text alignment and position, particularly the MarginLeft. I attempt to adjust the position of the text fragments using the following code:

double deltaX = pageSetup.MarginLeft - page.MediaBox.LLX;
TextFragmentAbsorber textAbsorber = new TextFragmentAbsorber();
page.Accept(textAbsorber);

foreach (TextFragment textFragment in textAbsorber.TextFragments)
{
    foreach (TextSegment segment in textFragment.Segments)
    {
        segment.Position = new Position(deltaX, segment.Position.YIndent);
    }
}

Issue:

The text is moved to a position close to the original but not exactly aligned.
In addition, some text ends up hidden under the margins, as shown in the attached image.

My questions are:

Is there a more accurate way to restore the original positions of text and images in the PDF document?
Could there be an alternative approach to align content relative to the original margins and dimensions?

It is critical that the margins remain unchanged and consistent with the original PDF.

Thank you in advance for your assistance!

asad.ali · December 3, 2024, 8:56pm

@gabriel.vega

When you are processing the PDF i.e. converting it into Word and then HTML, the original margins and dimensions are already lost and now you have the content in some other file format i.e. DOC/DOCX or HTML. Converting HTML or Word back to the PDF format will fetch the settings from source file or the document and it could be different what original PDF used to have in the start.

Nevertheless, we need some more details to carry out the investigation and check if we can achieve your requirements or not. Can you please provide a minimal code sample that can show the basics of your process along with sample source and output files at every step? We will test the scenario in our environment and address it accordingly.

gabriel.vega · December 10, 2024, 6:37am

@asad.ali,

Thank you for your response. As per your request, I have prepared a minimal sample project that demonstrates the complete workflow I am using:

PDF to DOCX
DOCX to HTML
HTML back to DOCX
DOCX to PDF

In the attached project, you will find:

The source PDF file used for testing.
Intermediate stream generated during the process (DOCX, HTML).
The final output PDF file.

Observations

Even though the HTML content is not edited during the process, the final PDF exhibits the following issues:

Margins and text alignment are different from the original PDF.
Images from the original PDF are missing in the final output.

DemoAsposeConverter.zip (423.4 KB)

If there are inherent limitations in the current functionality, kindly let me know if there are any configurations or additional steps I can take to address this.

Please let me know if any further information or adjustments are required for the provided example.

Thank you for your assistance.

asad.ali · December 10, 2024, 5:46pm

@gabriel.vega

Looks like you are using Aspose.PDF only at first step i.e. convert PDF to DOCX. For rest of the operations, you are using Aspose.Words. Therefore, we are moving this query to Aspose.Words category where you will be assisted shortly.

alexey.noskov · December 11, 2024, 6:21am

@gabriel.vega I am afraid is is technically impossible to preserve original document formatting after PDF->DOCX->HTML-PDF roundtrip.
You should note, Aspose.Words is designed to work with MS Word documents. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model . On the other hand PDF documents are fixed page format documents. While loading PDF document Fixed Page Document structure is converted into the Flow Document Object Model. Unfortunately, such conversion does not guaranty 100% fidelity and might be quite resource consuming.
The same applies to HTML. HTML documents and MS Word documents object models are quite different and it is not always possible to provide 100% fidelity after conversion one model to another. In most cases Aspose.Words mimics MS Word behavior when work with HTML.

So I would suggest to reconsider your workflow and avoid conversion your document to intermediate formats.

gabriel.vega · December 12, 2024, 6:48pm

@alexey.noskov

Thank you for your detailed explanation; I really appreciate your response. The reason I attempted the PDF->DOCX->HTML-PDF roundtrip is that I couldn’t find a way to convert a PDF directly to HTML while ensuring that the images in the PDF are converted to Base64 format. This is essential for me because I need the images embedded as Base64 to properly display them in the HTML editor I am implementing.

Is there a way to directly convert the images from the PDF to Base64 within the HTML during the conversion process? This would allow me to skip the intermediate Word format and potentially achieve higher fidelity.

Thank you in advance for your advice!

alexey.noskov · December 13, 2024, 5:47am

@gabriel.vega You can directly convert PDF to HTML using Aspose.Words without saving to intermediate DOCX:

Document doc = new Document(@"C:\Temp\in.pdf");
HtmlSaveOptions opt = new HtmlSaveOptions();
opt.ExportImagesAsBase64 = true;
doc.Save(@"C:\Temp\out.html", opt);

But this will not improve fidelity, because anyways the document is loaded into Aspose.Words DOM, which is designed to work with MS Word document.