Hi,
I need to extract individual articles from PDF pages of the newspaper.
sample.jpg (331.5 KB)
For example above, there is 3 articles in the page and I need toextract them individually, so a simple “pdf to text” doesn’t work.
I saw some solutions like pdftron working here:
pressreader.com/brazil/folha-de-s-paulo/20220720/281479280146591/textview
Do you have any solution for it? I need an API or SDK to run on-premise and extract the articles as XML or JSON for each article inside the pdf.
@marisonsouza,
There are multiple ways to achieve that. Here is the link to the documentation that describes how to achieve it. Extract Text from PDF using C#|Aspose.PDF for .NET
It’s just text extraction and paragraph extraction
In other cases I would need to know the dimensions and coordinates
But consider a newspaper (print) with 6 articles inside it, with photos, titles, subtitles, body… how can I extract each article individually and understand the relation between the titles, substitles and bodies?
@marisonsouza,
The absorbers have a rectangle. That way you can limit a specific absorber to a particular article. It cannot be done automatically. You have the give the Rectangle coordinates.