I am currently using Aspose.PDF for .Net version 17.11. I am trying to convert PDF files to text using the TextAbsorber class. which works for most files, but some yield texts that looks shuffled and also with weird spacing. Are there any specific options I should use?
The conversion code is:
var pdf = new Aspose.Pdf.Document(path);
var absorber = new TextAbsorber();
pdf.Pages.Accept(absorber);
var text = absorber.Text;
The problem can be recreated with this file: BIB-DE000DGQ4425.pdf (71.4 KB)
The conversion of the attached file results in this: extraction_result.zip (6.5 KB)
What can I do to solve this problem?
@Charybdis
Thank you for contacting support.
We would like to request you to please try using below code snippet with Aspose.PDF for .NET 18.3 in your environment. We have attached generated file for your kind reference. BIB-DE000DGQ4425_18.3.zip
// Open document
Document pdfDocument = new Document(dataDir + "BIB-DE000DGQ4425.pdf");
// Create TextAbsorber object to extract text
//TextAbsorber textAbsorber = new TextAbsorber();
TextAbsorber textAbsorber = new TextAbsorber(new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.MemorySaving));
// Accept the absorber for all the pages
pdfDocument.Pages.Accept(textAbsorber);
// Get the extracted text
string extractedText = textAbsorber.Text;
// Create a writer and open the file
TextWriter tw = new StreamWriter(dataDir + "BIB-DE000DGQ4425_18.3.txt", false, System.Text.Encoding.UTF8);
// Write a line of text to the file
tw.WriteLine(extractedText);
// Close the stream
tw.Close();
In case you notice any problem, please share a screenshot highlighting the issue so that we may try to reproduce and investigate it in our environment.
1 Like
Setting the TextExtractionOptions as described by you solves the problem, even with the older version.
However I lose the format of the input which is kind of a big deal when trying to parse tables. Is there any other way?
@Charybdis
Thank you for your kind feedback.
We are glad to hear that the problem has been resolved. Suggested approach does not extract the text exactly in same format as it appears in the PDF file. It ensures extracting all text content from the PDF file, so we are afraid that the format of input may not be maintained.