Extracting text from a page of a PDF

jhillflorida · June 22, 2015, 12:44pm

We have found that the Aspose PDF TextAbsorber object does a remarkable job of extracting text from a page of a PDF and forming it into a string that is broken into lines with carriage return/line feeds (using the Pure option). This is a very difficult undertaking and the product is performing very well. The ability to have text broken into readable lines is extremely compelling and not something that a lot of other products can do (and certainly not without expensive server runtime licenses). Thank you!

I have a question though. We can break the text that comes from the TextAbsorber into lines based on carriage return/line feed characters. When using the TextFragmentAbsorber can you suggest a technique for determining which line number the text fragment belongs to? We are experimenting with doing a linear walk through the text fragments and comparing that to the page text lines as created from the TextAbsorber. Can you suggest an easier method? Obviously a lot of intelligence went into determining when to break and create a new line in the text in the TextAbsorber. I assume you are analyzing the baseline positions of the text and making a guess. Is there any correlation that we could use to trace back? Or are we simply on the right track by walking through the text fragments one at a time while comparing them to the entire page text from the TextAbsorber?

Thanks in advance.

jhillflorida · June 22, 2015, 7:17pm

The reason I ask is we would like to find the full character positions for each character that make up a line of text.

codewarior · June 23, 2015, 4:36pm

Hi Joseph,

Thanks for contacting support.

The requested feature to determine the line break so that we can identify the particular line number for each TextFragment is currently not supported. However for the sake of implementation, we already have logged it as PDFNEWNET-37251 in our issue tracking system. The development team will further look into the details of this requirement and will keep you posted on the status of correction. Please be patient and spare us little time.

We are sorry for this inconvenience.

jhillflorida · June 24, 2015, 12:51pm

Thanks for considering this. We can piece the line text back together. But if we knew for each text block which line you considered it to be a part of that would be huge. Or else some kind of way to track the text components that were used to make up the line down so that we could find out their position information.

codewarior · June 26, 2015, 4:48am

Hi Joseph,

Thanks for sharing the details and sorry for the delayed response.

In order to determine the position information of text, the scenario can be catered during the investigation/implementation of earlier logged requirement. These details have been associated with the requirement ID and development team will surely consider these points during the implementation of this feature.

Please note that currently we do not have any option to manipulate text lines and determine the position of text elements.

aspose.notifier · April 13, 2018, 6:53pm

The issues you have found earlier (filed as PDFNET-37251) have been fixed in Aspose.PDF for .NET 18.4. This message was posted using BugNotificationTool from Downloads module by asad.ali