Superscript and apostrophe

Please check attached pdf file, I’m trying to save a pdf as txt and some characters are not getting replaced correctly

the superscript 10^2 in the pdf is getting converted to a dash (-) and the apostrophe is getting converted to a dash as well. Please advise

Hi Akram,


Thanks for your inquiry. I have tested the scenario with latest version of Aspose.Pdf for .NET 10.4.0 and unable to notice the reported superscript issue. Please download and try latest version of Aspose.Pdf for .NET, it will resolve the issue.

Please feel free to contact us for any further assistance.

Best Regards,

Please check attached files. I am saving the pdf to text using aspose.pdf using the following spinet.

                For Each page As Aspose.Pdf.Page In pdfDocument.Pages
textAbsorber = New Aspose.Pdf.Text.TextAbsorber
page.Accept(textAbsorber)
extractedText = extractedText & textAbsorber.Text & vbCrLf
Next
the output file I’m getting have some issues with the superscript… the first line with superscripts is fine, the second one is making the superscript number a flat number.



PS: I generated the pdf file from a word document using aspose.words

Please advise

Hi Akram,


Thanks for your inquiry. I have noticed the superscript extraction issue in your shared PDF document and logged a ticket PDFNEWNET-38797 for further investigation and rectification. We will notify you as soon as it is resolved.

We are sorry for the inconvenience caused.

Best Regards,

Hi Tilal,
Any luck with this issue? we have a client waiting on this fix and it’s been a good while now.

One thing I noticed, the superscript is a problem when it’s a superscript in the original word file. the instance when it is fine is when using the 2 and 3 characters in superscript shape (char 00B2 and 00B3)


Hi Akram,


As we recently have been able to notice this issue, and until or
unless we have investigated and have figured out the actual reasons of this
problem, we might not be able to share any timelines by which this problem will
be resolved.
<o:p></o:p>


However I have also shared your feedback with development and they will surely consider this information during the resolution of this issue.

Hi Nayyer,
is there anyway to expedite this request?

Hi Akram,


Thanks for your inquiry. I am afraid your issue is still pending for investigation in the queue. Currently product team is busy to resolve other issues, reported earlier. However we have requested our product team to complete the investigation and share an ETA at their earliest. We will notify you as soon as we made some significant progress towards issue resolution.

Moreover in reference to escalation, we schedule issue investigation and resolution on first come first serve basis. We feel this is the fairest and most appropriate way to satisfy the needs of the majority of our customers. However, if you have subscribed to our priority support service. Then you can escalate the issue using it, as paid support has different queue.


We are sorry for the inconvenience caused.


Best Regards,

any luck on the ETA?
Thanks

Hi Akram,


Thanks for your inquiry. Our product team has planned investigation of your issue. I am in coordination with the team and as soon our product team complete the investigation then will let you know the ETA accordingly.

Thanks for your patience and cooperation.

Best Regards,

any progress on this?

Hi Akram,


We are sorry for the inconvenience. I am afraid the product team is still busy in resolving other issues in the queue, reported earlier. However we have again requested our team to complete the investigation and share an ETA. We will update you as soon as we get a feedback.

Thanks for your patience and cooperation.

Best Regards,

Hi Akram,


Thanks for your patience.

We have further looked into earlier reported PDFNEWNET-38797 issue and as per our observations, it does not seem to be an issue with text extraction. Please note that PDF line ‘LTs 191 x10³/uL’ contains superscript as U+00B3 character and we have found no problems while extraction it. The next line ‘WBC 1 x 103/uL, RBC 1 x 103/uL, Hct 1%.t’ contains no superscript characters. It contains ‘3’ U+0033 characters. The superscript is emulated by little font size and upper shifting of ‘3’. It will be extracted as ‘3’ and it is normal work of text extraction. Text extraction has no special mode for handling font size and little vertical shifts.

Please take into account that text extraction has limited handling of text formatting. Whereas Abobe Acrobat extracts identical text from the document. In case you still require some special handling of superscripts while extracting text from such kind of documents, we may consider logging this requirement as an enhancement in our issue tracking system. Please acknowledge, so we may reply accordingly.