Extract text & structure information from PDF files

t.crecelius · December 11, 2017, 12:21pm

Is it possible with Aspose PDF for .NET to get structure information from a PDF file, e.g. paragraph, table, header, footer, etc.

Moreover, is it possible to get the text from multi-column layouts in the correct reading order but as a single block of text from those coloumns?

I checked out the example code provided with the SDK to extract text. However, the output is always a text file that tries to mimic the layout of the PDF file. However, when extracting text from a PDF, not the layout is important but to get the text in the correct reading order.

Somehow, I hoped to have access to the DOM structure of a PDF file as explained in the “technical articles” section: Aspose.PDF for .NET Documentation|Aspose.PDF for .NET

However, the “Aspose.Pdf for .NET help” file (filename: ‘Aspose.Pdf.chm’) shows a dead link for the namespace “Aspose.Pdf.DOM” and there seems to be no such namespace at all.

kind regards,
Tom

asad.ali · December 11, 2017, 4:29pm

@t.crecelius

Thanks for contacting support.

Would you please some more details regarding your this requirement, so that we can further check details at our side and share our feedback accordingly. However, in case you want to extract content of PDF document in a way that it can be determined whether it is paragraph, table, footer, or header - please note that header and footer are defined at the stage of PDF generation and once the PDF is generated, they become part of PDF content and cannot be differentiated.

Furthermore, you may extract text from PDF as well as tables by following instructions given over following article(s):

Would you please share a sample PDF document, so that we can test the scenario in our environment and address it accordingly.

Please note that we are going to stop providing *.chm file along with API package and referring our users to online API Reference section, which is why the *.chm file was not being maintained. For latest and updated API namespaces and classes, please visit Aspose.Pdf for .NET API References page.