Different characters in extracted text from pdf file

huseyincandan · August 12, 2015, 8:57am

Hi,

When I extract text from the attached pdf file, some part of the text contains irrelevant characters, mainly bottom-right part of the document.

I think it may be a font issue. Am I right, could you please address the problem? Is there a way to extract text correctly? Do you have any suggestions?

Thank you?

tilal.ahmad · August 12, 2015, 11:44pm

Hi Huseyin,

Thanks for your inquiry. I have tested the scenario and noticed the extracted text issue. So I have logged a ticket PDFNEWNET-39188 in our issue tracking system for further investigation and resolution. As soon as the issue investigation completes then will be in position to share issue cause. We will keep you updated about the issue resolution progress within this forum thread.

We are sorry for the inconvenience caused.

Best Regards,

huseyincandan · March 28, 2016, 9:12am

Hi,

I was wondering if there’s any progress on the issue.

Thanks & regards,

tilal.ahmad · March 28, 2016, 12:08pm

Hi Huseyin,

Thanks for your inquiry. I am afraid your above reported issue is still not resolved as product team is busy in resolving other issues in the queue, reported earlier. However we have raised the priority of your issue and requested our team to share the ETA at their earliest. We will notify you as soon as we made some significant progress towards your issue resolution.

Thanks for your patience and cooperation…

Best Regards,

asad.ali · October 10, 2017, 10:19am

@huseyincandan

Thanks for your patience.

Our product team has investigated earlier reported issue and according to their findings, there were some fonts in the documents which contain no character map (CMap), which was why full extraction of text was impossible. Please note that, this is about fonts with keys ‘F2’ and ‘F4’, which are Type0 fonts with Identity-H encoding and CIDFontType2 descendant fonts.

Furthermore, registry and ordering values in its CIDSystemInfo dictionary aren’t standard, which is why Adobe Acrobat is also unable to extract the document text completely. (See: Acro_convert_error.jpg (12.1 KB)) Unfortunately the part of text cannot be extracted correctly from the document and it is not a bug of Aspose.Pdf but flaw of source document.

In case of any further assistance, please feel free to let us know.