Loading Pdf and saving to Docx: texts are positioned wrong

See attached sample: it just opens a PDF file and saves it to DOCX. Now, a lot of texts that were placed at specific locations over an image are placed completely wrong.

Here is a screenshot of the resulting word doc:

And this is the sample - it contains also the original Pdf file.
PdfToWord.zip (965.4 KB)

The Pdf file shows the seat plan for a open air theater, and the numbers are the seat numbers that seem to be absolutely placed texts. I received this file from a customer, so I don’t know how it was generated.

I hope this is something that you can optimize, as we have to use this feature to add existing PDF files to the end of a report.

@wknauf
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): WORDSNET-27271

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

You should note, Aspose.Words is designed to work with MS Word documents. MS Word documents are flow documents and they have structure very similar to Aspose.Words Document Object Model. On the other hand PDF documents are fixed page format documents . While converting PDF document Fixed Page Document structure is converted into the Flow Document Object Model. Unfortunately, such conversion does not guaranty 100% fidelity.

Well, when opening the PDF with Word, the layout seems to be fine. So I hope this is something that you can fix.

@wknauf We will keep you updated and let you know once the issue is resolved or we have more information for you.

Do you have any updates for me? Is there a chance that you improve this?

@wknauf I am afraid there are no news regarding the issue yet. Please accept our apologies for your inconvenience.

I see that the linked issue has state “analysis complete”. Do you have any updates for me? Is there hope that this can/will be improved by Aspose.Words ;-)?

@wknauf Here is the result of analysis:

It is very complicated task to change current document recognition logic to support such files.
Maybe we can add option in PdfLoadOptions to create fixed layout in docx file.

Thanks for the feedback. This sounds rather complicated. Is there a schedule for an implementation?

Our use case is that we want to import a PDF file and append it to an existing word document. I use this code:

Document docPdf = new Document(datei.DateiPfad);

docPdf.FirstSection.PageSetup.SectionStart = SectionStart.NewPage;
docPdf.FirstSection.PageSetup.RestartPageNumbering = true;

//Unlink headers/footers so that the previous header is stopped: 
docPdf.FirstSection.HeadersFooters.LinkToPrevious(HeaderFooterType.HeaderFirst, false);
docPdf.FirstSection.HeadersFooters.LinkToPrevious(HeaderFooterType.HeaderPrimary, false);
docPdf.FirstSection.HeadersFooters.LinkToPrevious(HeaderFooterType.HeaderEven, false);

docPdf.FirstSection.HeadersFooters.LinkToPrevious(HeaderFooterType.FooterFirst, false);
docPdf.FirstSection.HeadersFooters.LinkToPrevious(HeaderFooterType.FooterPrimary, false);
docPdf.FirstSection.HeadersFooters.LinkToPrevious(HeaderFooterType.FooterEven, false);

this.Document.AppendDocument(docPdf, ImportFormatMode.KeepSourceFormatting);

Would this still work if the pdf is loaded with fixed layout?

@wknauf Unfortunately, there are no estimates yet.

Yes, there should not be any problems with appending document to existing one. But there might be difficulties with document editing, since MS Word documents are flow by their nature and it is hard to edit fixed content in the documents.

In our use case, there is no reason to edit the content of the appended pdf. Only the first part of the document (before the SectionStart.NewPage) might be edited. So, this might work.

Well, I have to continue waiting and will ask for an update every few weeks :wink:

1 Like