I need to extract individual articles from PDF pages of the newspaper.
sample.jpg (331.5 KB)
For example above, there is 3 articles in the page and I need toextract them individually, so a simple “pdf to text” doesn’t work.
I saw some solutions like pdftron working here:
Do you have any solution for it? I need an API or SDK to run on-premise and extract the articles as XML or JSON for each article inside the pdf.
There are multiple ways to achieve that. Here is the link to the documentation that describes how to achieve it. Extract Text from PDF using C#|Aspose.PDF for .NET
It’s just text extraction and paragraph extraction
In other cases I would need to know the dimensions and coordinates
But consider a newspaper (print) with 6 articles inside it, with photos, titles, subtitles, body… how can I extract each article individually and understand the relation between the titles, substitles and bodies?
The absorbers have a rectangle. That way you can limit a specific absorber to a particular article. It cannot be done automatically. You have the give the Rectangle coordinates.