Please check attached pdf file, I’m trying to save a pdf as txt and some characters are not getting replaced correctly
the superscript 10^2 in the pdf is getting converted to a dash (-) and the apostrophe is getting converted to a dash as well. Please advise
Hi Akram,
Thanks for your inquiry. I have tested the scenario with latest version of Aspose.Pdf for .NET 10.4.0 and unable to notice the reported superscript issue. Please download and try latest version of
Aspose.Pdf for .NET, it will resolve the issue.
Please feel free to contact us for any further assistance.
Best Regards,
Please check attached files. I am saving the pdf to text using aspose.pdf using the following spinet.
For Each page As Aspose.Pdf.Page In pdfDocument.Pages
textAbsorber = New Aspose.Pdf.Text.TextAbsorber
page.Accept(textAbsorber)
extractedText = extractedText & textAbsorber.Text & vbCrLf
Next
the output file I’m getting have some issues with the superscript… the first line with superscripts is fine, the second one is making the superscript number a flat number.
PS: I generated the pdf file from a word document using aspose.words
Please advise
Hi Akram,
Thanks for your inquiry. I have noticed the superscript extraction issue in your shared PDF document and logged a ticket PDFNEWNET-38797 for further investigation and rectification. We will notify you as soon as it is resolved.
We are sorry for the inconvenience caused.
Best Regards,
Hi Tilal,
Any luck with this issue? we have a client waiting on this fix and it’s been a good while now.
One thing I noticed, the superscript is a problem when it’s a superscript in the original word file. the instance when it is fine is when using the 2 and 3 characters in superscript shape (char 00B2 and 00B3)
Hi Nayyer,
is there anyway to expedite this request?
any luck on the ETA?
Thanks
Hi Akram,
Thanks for your inquiry. Our product team has planned investigation of your issue. I am in coordination with the team and as soon our product team complete the investigation then will let you know the ETA accordingly.
Thanks for your patience and cooperation.
Best Regards,
Hi Akram,
We are sorry for the inconvenience. I am afraid the product team is still busy in resolving other issues in the queue, reported earlier. However we have again requested our team to complete the investigation and share an ETA. We will update you as soon as we get a feedback.
Thanks for your patience and cooperation.
Best Regards,
Hi Akram,
Thanks for your patience.
We have further looked into earlier reported PDFNEWNET-38797 issue and as per our observations, it does not seem to be an issue with text extraction. Please note that PDF line ‘LTs 191 x10³/uL’ contains superscript as U+00B3 character and we have found no problems while extraction it. The next line ‘WBC 1 x 103/uL, RBC 1 x 103/uL, Hct 1%.t’ contains no superscript characters. It contains ‘3’ U+0033 characters. The superscript is emulated by little font size and upper shifting of ‘3’. It will be extracted as ‘3’ and it is normal work of text extraction. Text extraction has no special mode for handling font size and little vertical shifts.
Please take into account that text extraction has limited handling of text formatting. Whereas Abobe Acrobat extracts identical text from the document. In case you still require some special handling of superscripts while extracting text from such kind of documents, we may consider logging this requirement as an enhancement in our issue tracking system. Please acknowledge, so we may reply accordingly.