Can't extract text from PDF with image

egreaves · July 19, 2017, 3:00pm

I am using Aspose.pdf 9.9.0.0 and extracting the text from a PDF. The code I use works fine with a similar PDF but for the attached PDF

TESTING NEW REPORT.pdf (1.2 MB)

For page 1, I get only the header footer and heading extracted. The difference between these PDFs is that the problem one has a header and footer and a logo image whereas the one that works is a page of text within boxes. How do I get all of the text on this first page (and all of the pages).

Here is the extracted text:

Patient Name : LASTNAME, FIRSTNAME Page 1 of 18

                                  Brampton Civic Hospital - William Osler Health Centre
                                                  2100 Bovaird Drive E
                                                      Brampton, ON
                                                      (905) 494-2120
                                                PATIENT DEMOGRAPHICS


                                                      Serial # : 15961

egreaves · July 19, 2017, 6:16pm

I also attempted to use Aspose.pdf 17.7.0.0. It compiled and ran (after I removed the code to load the license) but now the extract pulls even less data:

Evaluation Only. Created with Aspose.Pdf. Copyright 2002-2017 Aspose Pty Ltd.
Patient Name : LASTN

egreaves · July 19, 2017, 6:53pm

I requested an evaluation license and tried again with Aspose.Pdf 17.7.0.0. Now I’m seeing the same behavior as 9.9.0.0.

imran.rafique · July 19, 2017, 9:14pm

@egreaves,
We managed to replicate the problem of not being able to extract all text items from the PDF. It has been logged under the ticket ID PDFNET-43064 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.

Best Regards,
Imran Rafique

egreaves · July 31, 2017, 6:32pm

Hello,

Is there an update on this issue? I see the bug is still in open state. I’m asking because we are approaching a software release and I need to know if we need to upgrade the Aspose PDF library to fix this issue. We also use the Doc library.

-Ed Greaves

imran.rafique · August 1, 2017, 12:46am

@egreaves,
The ticket ID PDFNET-43064 is pending for the analysis and not resolved yet. We have logged an ETA under the same ticket ID PDFNET-43064 in our issue tracking system. We will let you know once a significant progress has been made in this regard.

Best Regards,
Imran Rafique

egreaves · August 11, 2017, 8:26pm

Now that this issue is resolved, can you tell me when it will be released? i.e. what version do I need to buy to get this fix? Can I do a trial to see if this fixes my issue?

imran.rafique · August 12, 2017, 8:30am

@egreaves,
The ticket ID PDFNET-43064 has been resolved. If there is no issue in the quality assurance phase, then this fix will be included in the next version of Aspose.Pdf for .NET 17.9 and it is expected to be released in the mid of September, 2017. We will notify you once the new version 17.9 is published.

Best Regards,
Imran Rafique

naeem.akram · October 12, 2017, 4:29pm

@egreaves,

Thanks for your patience.

We are pleased to share that the issue PDFNET-43064 reported earlier is resolved in recent release of
Aspose.Pdf for .NET 17.9.

Please try using the latest release and in case you face any issue, please feel free to contact.