Free Support Forum - aspose.com

Aspose.PDF - Text Extract Options?

In using the TextAbsorber what are the Options that can be set to change how the text is extracted or output? What is Raw Mode and what are the alternate modes?
Are there other methods to extract text that have options to control how text is extracted or output?
I see there is docs about the scaleFactor but are there other settings?
I see an ExtractionOptions Property but no documentation on what the settings are or what they do…
Regards, Jean. :slight_smile:

@jeanjunker1

Thank you for contacting support.

TextExtractionOptions Class exposes TextFormattingMode Enumeration that includes Pure, Raw and MemorySaving modes which are described in API references.

We hope this will be helpful. Please feel free to contact us if you need any further assistance.

Thanks!
I found the TextExtractionOptions settings and tried them.

BUT, I think your Docs ARE INCORRECT, as the MemeorySaving Mode seems to produce output exactly like the PURE mode, NOT the RAW mode…
Let me know if i am misunderstanding something in how that works…

NOTE: It would be very nice to see an actual UsersGuide type Docs that explain each class and how it works and how it is called with additional examples in VB. The current Docs don’t do a very good job of explaining the classes, how to use them, and what functionality that they provide. The Docs are too technical.

However, The Code is producing very nice Text output for us so far, Good Job. Still Testing…
Jean. :slight_smile:

@jeanjunker1

Thank you for your kind feedback.

We have tested text extraction using the code from Extract Text from Pages using Text Device and generated 3 TXT files for each extraction mode. We noticed MemorySaving mode to produce almost similar results as RAW mode, as per the description in API references. Would you please share your PDF document along with three TXT files containing extracted text so that we may investigate further.

About UsersGuide, it would be difficult for us to cover a lot of possible use cases, so the general scenarios have been documented and we are working on improving the documentation further. For VB code, we have discontinued providing examples based on VB Code. However you may please use some online C# to VB code converters, as per your convenience.

I now have an problem with PURE mode .txt Output Formatting in some .pdf files with ReportGW_DEVON_2020-09-Pure.zip (166.8 KB)
data.
Attached is the PDF and the TXT file, notice the irregular spacing on the LAST DATA line. This happens randomly and sometimes at the beginning of the document. This is a big problem as we utilize the postions of the columns to process the result txt data. I have recently used the Scaling Technique that i saw in another post but the results are the same or similar.

@jeanjunker1

Can you please share snapshot and page number of PDF file where you are noticing irregular spacing so that we may help you further in this regard.