Problem getting positions for individual characters

I’m extracting text from PDFs together with their coordinates, and ideally would like to get the coordinates of each character individually. I’ve tried using a TextFragmentAbsorber with a regular expression that matches a single character, and this almost works, but there appears to be a bug which means that when a word has a repeated character, on the second occurrence the returned coordinates are those of the first occurrence.


Here’s a code fragment and a sample PDF (attached) which illustrate the problem. The X value for a repeated character will ‘jump backwards’.

TextFragmentAbsorber tfa = new TextFragmentAbsorber(".") { TextSearchOptions = new TextSearchOptions(true) };
pdf.Pages[page].Accept(tfa);
foreach (TextFragment tf in tfa.TextFragments)
{
Console.WriteLine("{0}: {1}", tf.Text, tf.Rectangle.LLX);
}

Thanks for any help or workaround you can provide.

George

Hi George,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for sharing the details.

I am able to reproduce your mentinoed issue after an initial test. For rectification, the issue has been registered in our issue tracking system with issue id: PDFNEWNET-34371. We will notify you via this forum thread regarding any updates against the issue.

Sorry for the inconvenience,

Hi,

Any update available on this issue?

Thanks!

George

Hi George,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

I am afraid; we don’t have any update for you regarding your reported issue at the moment. However, I have requested the development team to share an ETA regarding the resolution of the issue. Once I get a feedback from them, I will update you via this forum thread. Please spare us some time and we will get back to you with details.

Sorry for the inconvenience,

Hi George,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

I have received a feedback from the development team and as per the plan; the fix for your issue will be a part of Aspose.Pdf for .NET v7.7 to be released in January 2012. However, please keep in mind that it is an ETA (not a promise) and in case there is some change in the plan by the development team (due to other priority issues), we will update you via this forum thread.

Sorry for the inconvenience,

Many thanks for the feedback.

George

The issues you have found earlier (filed as PDFNEWNET-34371) have been fixed in Aspose.Pdf for .NET 7.7.0.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.