Strange things on extracting highlighted text from PDF

cjhsu99 · July 26, 2019, 12:48am

Hi,

I am trying to extract the text from a PDF file, but it comes some strange results.

superscript goes to wrong location in the result text.
supposed : which allows imaging and quantitative analysis of individual cells in tissues in situ,19
result : which allows imagingand quantitative analysis of individual,19cells in tissues in situ,
not exact sentence sequence when the highlighted text is crossing 2 columns
supposed : new structures and assays. Such lab-on-a-chip devices further allow the analysis with improved performance, throughput,
result : further allow the analysis with improvedperformance, throughputnew structuresand assays. Such lab-on-a-chip devices
missing space characters in result text.
supposed : Regardless of the chemistry being performed, the high-throughput capability and increased sensitivity of these devices, as well as the ability to construct cell-friendly microenvironments,
result : Regardless ofthe chemistry being performed, thehigh-throughputcapabilityandincreased sensitivity of these devices, aswell as the ability to construct cell-friendly microenvironments

Here is the code :

Aspose.Pdf.Annotations.TextMarkupAnnotation textMarkup = annotation as Aspose.Pdf.Annotations.TextMarkupAnnotation;
if (textMarkup != null)
{
Aspose.Pdf.Text.TextFragmentCollection collection = textMarkup.GetMarkedTextFragments();
string gettext = “”;
foreach (Aspose.Pdf.Text.TextFragment tf in collection)
{
gettext += tf.Text;
}
if (annotation.Contents == null )
{
outtext += “” + gettext + “”;
}
else if(annotation.Contents != null && gettext != “”)
{
outtext += “[” + annotation.Contents + “]<br />” + gettext + “”;
}
}

I put an example file in attachment. Thanks for your kindly support.

test_001.pdf (99.9 KB)

Farhan.Raza · July 26, 2019, 12:05pm

@cjhsu99

Thank you for contacting support.

We have worked with the data shared by you and have been able to reproduce the issue in our environment. A ticket with ID PDFNET-46744 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.

cjhsu99 · August 6, 2019, 12:53am

Hi Farhan,
Just let you know that if I copy-paste (crtl-c, crtl-v) the highlighted text from pdf to word, all the strange things I mentioned above disappeared. Every thing seems normal.

Farhan.Raza · August 6, 2019, 9:40am

@cjhsu99

Thank you for the update.

We have recorded your comments and will get back to you once any significant update will be available in this regard.