Text extraction from shape in PDF

choiys · January 17, 2023, 8:05am

6_partofpart.pdf (44.2 KB)
6_partofpart_textboxremove.pdf (2.8 KB)

hi,

When text is extracted from the 6_partofpart.pdf file, “BP2-08” can be extracted, but “BP2-05” cannot. When I checked using the PDF editor, “B”, “P”, “2”, “-”, “0”, and “5” were the shapes. Can’t we extract this to text using aspose?

For reference, all characters in the 6_partofpart_textboxremove.pdf file are shapes.

andrey.nekrasov · January 17, 2023, 4:32pm

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

  Issue ID(s): PDFNET-53482

You can obtain Paid Support services if you need support on a priority basis, along with the direct access to our Paid Support management team.

choiys · January 17, 2023, 10:38pm

What does PDFNET-53482 mean?
I don’t know if the issue is to develop a function to extract text from a shape or to check the possibility of development.

andrey.nekrasov · January 18, 2023, 10:17am

@choiys
This issue is to investigate and to search workaround if possible. For now, we don’t have plans to implement this feature.

choiys · January 18, 2023, 10:56am

Are you an employee here? This is the first time I receive a strange answer saying that there is no plan.

andrey.nekrasov · January 18, 2023, 2:19pm

@choiys

We are sorry if any of our previous replies caused any confusion here. Please note that Aspose.PDF is specialized to deal with only PDF documents and their conversion into other file formats. Recognizing text from an image or shape is out of the scope of the API.

Furthermore, Aspose.PDF offers a feature to convert scanned PDF documents into searchable PDF document where it uses third party OCR.

Your provided PDF has mixed content i.e. text and images. Whereas, images in your PDF are drawn graphics actually. So we logged an investigation ticket in order to determine whether there is any workaround to extract text from it or not. Apparently, it could be achieved by converting the whole page to image and then perform OCR on that image because at the moment drawn objects/shapes from the PDF cannot be extracted. Another dedicated ticket to extract drawn shapes is already logged in our issue tracking system as PDFNET-51913.

Therefore, we are afraid that your requirements cannot be achieved using Aspose.PDF only. You can however, convert PDF Pages to Images and then perform OCR on them using Aspose.OCR. In case you have further concerns, please feel free to share.

choiys · January 20, 2023, 4:25am

Thank you for the detailed reply.
In fact, I also have the “Aspose.Total” solution, so I tried OCR after converting the shape object into an image, but the ocr did not work well because there were many lines or other shapes crossing the shape. Therefore, a function to extract characters directly from a shape object was needed.

It looks like issue PDFNET-51913 will need to be resolved to achieve that, so I’ll wait.

thank you.

andrey.nekrasov · January 24, 2023, 3:43pm

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-51913

You can obtain Paid Support services if you need support on a priority basis, along with the direct access to our Paid Support management team.

andrey.nekrasov · January 24, 2023, 3:48pm

We will keep you posted here on further updates and let you know when this issue will get resolved.

When PDFNET-51913 is fixed, we will be able to provide a code snippet how to extract text from shape using third-party OCR. We will inform you when PDFNET-51913 and PDFNET-53482 are fixed.

aspose.notifier · August 17, 2023, 11:47pm

The issues you have found earlier (filed as PDFNET-51913) have been fixed in Aspose.PDF for .NET 23.8.