I need to extract individual articles from newspaper PDF page

marisonsouza · April 18, 2023, 7:41pm

Hi,

I need to extract individual articles from PDF pages of the newspaper.

For example above, there is 3 articles in the page and I need toextract them individually, so a simple “pdf to text” doesn’t work.

I saw some solutions like pdftron working here:

pressreader.com/brazil/folha-de-s-paulo/20220720/281479280146591/textview

Do you have any solution for it? I need an API or SDK to run on-premise and extract the articles as XML or JSON for each article inside the pdf.

carlos.molina · April 18, 2023, 8:18pm

@marisonsouza,

There are multiple ways to achieve that. Here is the link to the documentation that describes how to achieve it. Extract Text from PDF using C#|Aspose.PDF for .NET

marisonsouza · April 19, 2023, 5:49pm

It’s just text extraction and paragraph extraction
In other cases I would need to know the dimensions and coordinates

But consider a newspaper (print) with 6 articles inside it, with photos, titles, subtitles, body… how can I extract each article individually and understand the relation between the titles, substitles and bodies?

carlos.molina · April 19, 2023, 6:02pm

@marisonsouza,

The absorbers have a rectangle. That way you can limit a specific absorber to a particular article. It cannot be done automatically. You have the give the Rectangle coordinates.