Aspose.Pdf - PDF to Docx missing text and other conversion issues

Muzna_Tariq · August 7, 2017, 12:54pm

Hi,

I am using Aspose.PDF for PDF to Docx conversion using following code.

Code Snippet:

Document pdfDocument = new Document(new FileStream(dataDir + “input.pdf”, FileMode.Open));
DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.Format = DocSaveOptions.DocFormat.DocX;
saveOptions.RecognizeBullets = true;
saveOptions.Mode = DocSaveOptions.RecognitionMode.Textbox;
saveOptions.RelativeHorizontalProximity = 2.5f;
pdfDocument.Save(dataDir + “\OutputDocx\” +“output.docx”, saveOptions);

Following issues are identified while converting PDF files in Docx:

Bullet points are converted as special symbols.
Some textual content are converted as an image.
Slide colors are changed and text is missing for some PDF files having some PowerPoint slides.
For PDF files with mix content (text and scanned), converted files have incomplete text.

I have attached PDF files and converted RTF files for two cases mentioned above.Please help me to resolve these issues.

Bullets.zip (115.2 KB)
SlideColors.zip (2.9 MB)

Thanks.

imran.rafique · August 7, 2017, 11:56pm

@Muzna_Tariq,
Thank you for the details. We are investigating your scenarios in our environment and will get back to you soon.

Best Regards,
Imran Rafique

imran.rafique · August 8, 2017, 8:36am

@Muzna_Tariq,
We have converted your PDF documents with the latest version 17.8 of Aspose.Pdf for .NET API and managed to replicate the issues as below:

File Name: SlideColors/Input.pdf
Snapshot: snapshot.png (113.0 KB)

PDFNET-43168 - PDF to DOCX - the color of slides is changed
PDFNET-43169 - PDF to DOCX - the duplicate text is added

We have linked your post to these tickets and will keep you informed regarding any available updates.

Kindly highlight these issues with the help of snapshots. These are the output Word documents: BulletsOutputDOCX.zip (91.8 KB) and SlideColorsOutputDOCX.zip (2.2 MB)

Best Regards,
Imran Rafique

Muzna_Tariq · August 8, 2017, 12:23pm

Please find the attached snapshots related to:

For PDF files with mix content (text and scanned), converted files have incomplete text.

snapshot_docx.png (7.6 KB)
snapshot_pdf.png (11.0 KB)

imran.rafique · August 9, 2017, 12:55am

@Muzna_Tariq,
We could not find the problematic area in both of your source PDF documents as we can see in the snapshots. Kindly let us know the PDF file name and page number. We will investigate and share our findings with you. Your response is awaited.

Best Regards,
Imran Rafique

Muzna_Tariq · August 10, 2017, 9:41am

Please download zipped folder from this link related to:

Some textual content are converted as an image.
For PDF files with mix content (text and scanned), converted files have incomplete text.

I have attached snapshots of converted Docx files related to issues mentioned above.

missing_text.png (7.6 KB)
text_image.jpg (76.1 KB)

imran.rafique · August 12, 2017, 8:15am

@Muzna_Tariq,
We are sorry for the delay. We managed to replicate the problem of textual content being converted as the image. It has been logged under the ticket ID in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates. We are sorry for the inconvenience.

We have converted your source PDF to DOCX with the latest version 17.8 of Aspose.Pdf for .NET API and unable to notice the problem of missing text (as shown in the snapshot: docx_snapshot.png). This is the output DOCX file: output17.8_DOCX.zip. Kindly let us now in case of any confusion or questions.

Best Regards,
Imran Rafique