Text Segment contains incorrect text

I am extracting the text from a PDF looping through the segments in the text fragment collection using c#. The majority of the text is extracted fine but one piece of text in the PDF is NAV0324NP but when I look at it in the text segment text is comes out as NAV0324NF.

Does anyone know why this could be happening? a P turning into an F!
Thanks.

Hi Meyrick,


Thanks for using our products.

Can you please share the source PDF file causing this problem so that we can test the scenario at our end. We are really sorry for this inconvenience.

Hi, thank you for getting back to me. I have attached a URL for the file as it is quite large.

The text in question is on page 7 (57) with the text NAV0324NP that come out as NAV0324NF.

The code I was using was similar to this:

//Go through PDF pages and save data
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
//accept the absorber for all the pages
curPage.Accept(textFragmentAbsorber);

//get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;
//loop through the fragments

foreach (TextFragment textFragment in textFragmentCollection)
{
foreach (TextSegment textSegment in textFragment.Segments)
{
if (!String.IsNullOrEmpty(textSegment.Text.Trim()))
{
var value = textSegment.Text.Trim();
}

}
}

Regards,
Meyrick

Hi Meyrick,


Thanks for providing additional information. I’m afraid we’re unable to access the shared link. Can you please re-confirm the document link or share only the problematic page here? So we will test the scenario and will provide you more information.

Sorry for the inconvenience faced.

Best Regards,

Hi,

Please try this link:

http://digidev.blob.core.windows.net/mezmeric-concise-p/51-67%20paper.pdf

I have managed to open it in Firefox and IE though it did not open in Chrome.

Regards,
Meyrick

Hi Meyrick,


Thanks for sharing the details. I succeeded to open document in IE, as earlier I was trying with Chrome.

While testing the scenario with Aspose.Pdf for .NET 8.2., I’m unable to notice the reported issue. Please find enclosed screenshot. Please download and try latest version of Aspose.Pdf for .NET API. Hopefully your issue will be resolved.

Please feel free to contact us for any further assistance.

Best Regards,

Hi,

Before I posted the issue I upgraded to 8.2 to check if that would help but it didn’t.
Did you use the text absorber to access the text that you extracted?
Regards,
Meyrick

Hi Meyrick,


Thanks for your feedback. Yes I’ve used TextAbsorber for the extraction from the specific page. After further investigation, I’ve managed to replicate issue by extracting text from complete document and logged it as PDFNEWNET-35626 in our issue tracking system for resolution. We will update you via this forum thread as soon as it is resolved.

Sorry for the inconvenience faced.

Best Regards,

Thank you for having the foresight to run through the problem again, much appreciated.

Hi Meyrick,


Thanks for your patience.

I am pleased to share that the issue reported earlier has been resolved and its fix will be included in next release of Aspose.Pdf for .NET 8.4.0, which is planned to release in September-2013. Please be patient and wait for the new release.

The issues you have reported earlier (filed as PDFNEWNET-35626) have been fixed in Aspose.Pdf for .NET 8.4.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.