Issue in extracting text from blocks in a single line

sudesh · August 22, 2019, 7:26am

When we try to extract text from a pdf file containing 2 columns or text in blocks, the text in a block is not extracted in a continuous form. Please refer to the attachment

The output can be like ‘A black hole is a region of spacetime exhibiting gravitational acceleration so strong that nothing…’ but instead, it combines text from both the columns and displays line wise.

This is from the sample code we use:
TextAbsorber textAbsorber = new TextAbsorber();
pdfDocument.Pages.Accept(textAbsorber);
string extractedText = textAbsorber.Text;

Please suggest any changes so that we can get the right output

2.pdf (6.1 MB)

asad.ali · August 22, 2019, 5:34pm

@sudesh

Please check attached screenshot of extracted text which we obtained while testing the scenario with Aspose.PDF for .NET 19.8 and following code snippet:

TextAbsorber ta = new TextAbsorber();
ta.ExtractionOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Pure);
Document pdfDocument = new Document(dataDir + "2.pdf");
pdfDocument.Pages.Accept(ta);
string text = ta.Text;

ExtractedText.png (20.8 KB)

Would you please make sure to use Aspose.PDF for .NET 19.8 and in case you still face any issue, please share screenshot of the results that your are getting at your side. We will again test the scenario in our environment and address it accordingly.

sudesh · August 23, 2019, 9:35am

The output is exactly the same as we get here. Now please try to select the first line. A column text is a continuous sentence. It’s like a paragraph. So the expect output can be

A black hole is a region of spacetime exhib-
iting gravitational acceleration so strong

instead of

A black hole is a region of spacetime exhib- The idea of a body so massive that even
iting gravitational acceleration so strong

The para loses it’s continuity if it reads text from both columns side by side. We want a way for aspose to recognize the column as a block of text and give us continuous output.

asad.ali · August 23, 2019, 8:51pm

@sudesh

Thanks for elaborating further.

Would you please also share that where are you using extracted text and how? In other words, are you generating a .txt file from extracted text or any other file format? This would help us investigating the scenario further accordingly.

sudesh · August 26, 2019, 6:44am

We actually do not save the text to any file. We run a direct search on the extracted text and use the search results in our application. That is why we wanted continuity in text of paragraphs (blocks) so that if we are to find two strings which are near to each other, they may not be near if we extract text in the above way (line by line)

asad.ali · August 26, 2019, 7:29pm

@sudesh

Thanks for further elaboration.

We have logged an issue as PDFNET-46899 in our issue tracking system for further investigation. We will definitely look into details of the issue and keep you posted with the status of its rectification. Please be patient and spare us little time.

We are sorry for the inconvenience.