Processing PDF content for shapes

MichaelP · March 29, 2021, 7:38am

Hello,
Using C#, I’m analyzing PDF documents in order to check if any texts, images or “objects” are not positioned within a specified margin.
Using word, I create a document including a shape (a basic arrow), images and texts.
I can easily looks for images and texts but did not manage to retrieve the shape.

Is there a way a retrieve all graphical objects included in a page ?
Many thanks

asad.ali · March 29, 2021, 6:54pm

@MichaelP

Can you please provide you sample PDF document along with the information of expected output you want to achieve using the API? Do you want to extract the coordinates of the objects? We will test the scenario in our environment and address it accordingly.

MichaelP · March 30, 2021, 6:51am

@asad.ali

Many thanks for your reply.

Here are two examples files:
Doc-00.pdf (77.3 KB)
Doc-11.pdf (39.0 KB)

What I’m expecting as output is a list of “graphical” object with their location and size. I don’t care about their content.

Using example on Extract Table from Existing PDF Document, I’ve been able to get some missing information about table decoration.

For Doc-00.pdf, I’m able to ‘extract’ the rectangles from both pages with position and size. I still don’t know if they all belong to graphics.
For Doc-11.pdf, I extract many rectangle at the same place…
On this page, they are (from a pure visual seeing) three images. One of them (the second shower picture in the middle of the page) is a Word shape. I’m not able to collect any information about this one.

Many thanks for any help on this question.
Regards

ADDENDUM :
on Doc-11.pdf, I retrieve many X coordinates with value near 0 (8.871E-06). If I remove these strange values, I have my 3 rectangles. I have to check if they corresponds to what I want.

asad.ali · March 30, 2021, 8:20pm

@MichaelP

Thanks for sharing the sample files.

We have checked the shared PDF files and noticed that they contain text, images and drawn graphics. You can extract text and images in the PDF along with their position and other properties. However, we have some logged tickets to investigate whether extraction of drawn graphics are possible or not. Could you kindly share the code snippet as well which you tried to extract the rectangle from Doc-00.pdf. We will further check the related information at our end and share our feedback with you.

MichaelP · March 31, 2021, 12:25pm

Here is the C# class I use for extracting drawn graphics.
This class lists per page these graphics and save a PNG file within theses graphics for a visual control.
The PNG file is located in the PDF source folder.

For the file Doc-11.pdf, which is generated using Word, the output seems pretty near what I’m looking for.
For the file Doc-00.pdf, which is generated by code using Aspose.Pdf, you can see that the Y coordinates for at least one drawn graphic is inverted. I don’t know why.

Many thanks for your help.

PdfAsposeTest.zip (1.6 KB)

asad.ali · March 31, 2021, 7:54pm

@MichaelP

We were able to notice the issue in our environment while testing the scenario using Aspose.PDF for .NET 21.3 and your code snippet. We have logged an issue as PDFNET-49691 in our issue tracking system for the sake of further investigation. We will look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.

MichaelP · April 1, 2021, 8:48am

Many thanks for your help.
I’ll wait until the correction.

In the meanwhile, I’m implementing another way by exporting the page to a PNG image and analyze if they are any non white or transparent pixels in it.
Not an accurate method depending on PDF quality but this is nowadays the only way I can imagine.

Many thanks again.

asad.ali · April 1, 2021, 7:43pm

@MichaelP

Sure, you may please try implementing the other method as per your requirements. We will surely work on resolving the ticket on first come first serve basis and inform you once it is resolved.