We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Different characters in extracted text from pdf file

Hi,


When I extract text from the attached pdf file, some part of the text contains irrelevant characters, mainly bottom-right part of the document.

I think it may be a font issue. Am I right, could you please address the problem? Is there a way to extract text correctly? Do you have any suggestions?

Thank you?

Hi Huseyin,


Thanks for your inquiry. I have tested the scenario and noticed the extracted text issue. So I have logged a ticket PDFNEWNET-39188 in our issue tracking system for further investigation and resolution. As soon as the issue investigation completes then will be in position to share issue cause. We will keep you updated about the issue resolution progress within this forum thread.

We are sorry for the inconvenience caused.

Best Regards,

Hi,


I was wondering if there’s any progress on the issue.

Thanks & regards,

Hi Huseyin,


Thanks for your inquiry. I am afraid your above reported issue is still not resolved as product team is busy in resolving other issues in the queue, reported earlier. However we have raised the priority of your issue and requested our team to share the ETA at their earliest. We will notify you as soon as we made some significant progress towards your issue resolution.

Thanks for your patience and cooperation…

Best Regards,

@huseyincandan

Thanks for your patience.

Our product team has investigated earlier reported issue and according to their findings, there were some fonts in the documents which contain no character map (CMap), which was why full extraction of text was impossible. Please note that, this is about fonts with keys ‘F2’ and ‘F4’, which are Type0 fonts with Identity-H encoding and CIDFontType2 descendant fonts.

Furthermore, registry and ordering values in its CIDSystemInfo dictionary aren’t standard, which is why Adobe Acrobat is also unable to extract the document text completely. (See: Acro_convert_error.jpg (12.1 KB)) Unfortunately the part of text cannot be extracted correctly from the document and it is not a bug of Aspose.Pdf but flaw of source document.

In case of any further assistance, please feel free to let us know.