Hi,
I am trying to extract the text from a PDF file, but it comes some strange results.
- superscript goes to wrong location in the result text.
supposed : which allows imaging and quantitative analysis of individual cells in tissues in situ,19
result : which allows imagingand quantitative analysis of individual,19cells in tissues in situ, - not exact sentence sequence when the highlighted text is crossing 2 columns
supposed : new structures and assays. Such lab-on-a-chip devices further allow the analysis with improved performance, throughput,
result : further allow the analysis with improvedperformance, throughputnew structuresand assays. Such lab-on-a-chip devices - missing space characters in result text.
supposed : Regardless of the chemistry being performed, the high-throughput capability and increased sensitivity of these devices, as well as the ability to construct cell-friendly microenvironments,
result : Regardless ofthe chemistry being performed, thehigh-throughputcapabilityandincreased sensitivity of these devices, aswell as the ability to construct cell-friendly microenvironments
Here is the code :
Aspose.Pdf.Annotations.TextMarkupAnnotation textMarkup = annotation as Aspose.Pdf.Annotations.TextMarkupAnnotation;
if (textMarkup != null)
{
Aspose.Pdf.Text.TextFragmentCollection collection = textMarkup.GetMarkedTextFragments();
string gettext = “”;
foreach (Aspose.Pdf.Text.TextFragment tf in collection)
{
gettext += tf.Text;
}
if (annotation.Contents == null )
{
outtext += “” + gettext + “”;
}
else if(annotation.Contents != null && gettext != “”)
{
outtext += “[” + annotation.Contents + “]<br />” + gettext + “”;
}
}
I put an example file in attachment. Thanks for your kindly support.
test_001.pdf (99.9 KB)