PdfExtractor GetText not retaining document order

Thread77 · June 24, 2011, 4:27pm

Hello,

I am using the PDFExtractor to get the text from a PDF file. The problem I am having is the order of the text isn’t preserved. For example, I have the following in the PDF in a table like structure:

Col1 Col2 Col3 Col4
----------------------------------
Val1 Val2 Val3 Val4
Val5 Val6 Val7 Val8

When I do GetText and save the text to file, it comes out as:

Col1 Col2
----------------
Val1 Val2
Val5 Val6
Val3
Val4
Val7
Val8

I looked in the raw pdf file and the way it extracts is the way it is in PDF raw data, but I want it to save in the same order as if I were looking at it on the screen. Is this possible? I tried changing the value of ExtractTextMode but this had no affect. I updated to the most recent version of Aspose today. We also have a license so this isn’t an evaluation copy.

Any help would be greatly appreciated!
Richard

shahzadlatif · June 25, 2011, 4:56am

Hi Richard,

Please share the input PDF file with us, so we could investigate the issue at our end and guide you accordingly.

We’re sorry for the inconvenience and looking forward to help you out.
Regards,

Thread77 · June 25, 2011, 4:55pm

Hi Shahzad,

I sent a copy of the document I am having issues with to your email. Please let me know if you didn't receive it.

Thanks for your assistance!

Richard

shahzadlatif · June 27, 2011, 4:02am

Hi Richard,

Thank you very much for sharing the PDF file. I have reproduced this problem at my end and logged it as PDFKITNET-28723 in our issue tracking system. Our team will look into this issue and you’ll be updated via this forum thread once it is resolved.

We’re sorry for the inconvenience.
Regards,