Aspose.PDF - Text Extract Options?

jeanjunker1 · June 19, 2019, 11:47am

In using the TextAbsorber what are the Options that can be set to change how the text is extracted or output? What is Raw Mode and what are the alternate modes?
Are there other methods to extract text that have options to control how text is extracted or output?
I see there is docs about the scaleFactor but are there other settings?
I see an ExtractionOptions Property but no documentation on what the settings are or what they do…
Regards, Jean.

Farhan.Raza · June 19, 2019, 8:05pm

@jeanjunker1

Thank you for contacting support.

TextExtractionOptions Class exposes TextFormattingMode Enumeration that includes Pure, Raw and MemorySaving modes which are described in API references.

We hope this will be helpful. Please feel free to contact us if you need any further assistance.

jeanjunker1 · June 20, 2019, 6:50am

Thanks!
I found the TextExtractionOptions settings and tried them.

BUT, I think your Docs ARE INCORRECT, as the MemeorySaving Mode seems to produce output exactly like the PURE mode, NOT the RAW mode…
Let me know if i am misunderstanding something in how that works…

NOTE: It would be very nice to see an actual UsersGuide type Docs that explain each class and how it works and how it is called with additional examples in VB. The current Docs don’t do a very good job of explaining the classes, how to use them, and what functionality that they provide. The Docs are too technical.

However, The Code is producing very nice Text output for us so far, Good Job. Still Testing…
Jean.

Farhan.Raza · June 20, 2019, 6:58pm

@jeanjunker1

Thank you for your kind feedback.

We have tested text extraction using the code from Extract Text from Pages using Text Device and generated 3 TXT files for each extraction mode. We noticed MemorySaving mode to produce almost similar results as RAW mode, as per the description in API references. Would you please share your PDF document along with three TXT files containing extracted text so that we may investigate further.

About UsersGuide, it would be difficult for us to cover a lot of possible use cases, so the general scenarios have been documented and we are working on improving the documentation further. For VB code, we have discontinued providing examples based on VB Code. However you may please use some online C# to VB code converters, as per your convenience.

jeanjunker1 · May 2, 2021, 11:57am

I now have an problem with PURE mode .txt Output Formatting in some .pdf files with ReportGW_DEVON_2020-09-Pure.zip (166.8 KB)
data.
Attached is the PDF and the TXT file, notice the irregular spacing on the LAST DATA line. This happens randomly and sometimes at the beginning of the document. This is a big problem as we utilize the postions of the columns to process the result txt data. I have recently used the Scaling Technique that i saw in another post but the results are the same or similar.

mudassir.fayyaz · May 3, 2021, 1:37pm

@jeanjunker1

Can you please share snapshot and page number of PDF file where you are noticing irregular spacing so that we may help you further in this regard.

jeanjunker1 · September 17, 2021, 2:09pm

Syntergy_PLAINS-APACHE-Aspose_PURE_Text_SampleFiles.zip (197.4 KB)

In the previously submitted DEVON data files, on Page 22 (last page) of the Text file, the data are skewed and columns are spaced irregular, the Original Pdf (provided) renders fine.

Attached in this response are TWO additional Pdf Documents of a similar style which exhibit the Irregular spacing in the TEXT output on Page 1.

We have tried some of the scaling options suggested in another post but cannot totally solve this issue. Sometimes however, if we REMOVE some of the Column Heading Lines and Other Informational Text, then the results are better but not always so we don’t have a good solution to getting the TEXT to render in a consistent format with consistent spacing. We cannot use the RAW mode to do this because often the “Columns” of data have Blank values and there is no way to know what is missing, we can only do this with PURE mode text output by examining the Column Locations.

Please advise or let us know if there is an update or fix for this.

mudassir.fayyaz · September 17, 2021, 9:13pm

@jeanjunker1

A ticket with ID PDFNET-50590 has been created in our issue tracking system to further investigate the issue on our end. This thread has been linked with the issue so that you may be notified once the issue will be fixed.