Problem getting positions for individual characters

GeorgeH · October 10, 2012, 4:20am

I’m extracting text from PDFs together with their coordinates, and ideally would like to get the coordinates of each character individually. I’ve tried using a TextFragmentAbsorber with a regular expression that matches a single character, and this almost works, but there appears to be a bug which means that when a word has a repeated character, on the second occurrence the returned coordinates are those of the first occurrence.

Here’s a code fragment and a sample PDF (attached) which illustrate the problem. The X value for a repeated character will ‘jump backwards’.

TextFragmentAbsorber tfa = new TextFragmentAbsorber(".") { TextSearchOptions = new TextSearchOptions(true) };

pdf.Pages[page].Accept(tfa);

foreach (TextFragment tf in tfa.TextFragments)

{

Console.WriteLine("{0}: {1}", tf.Text, tf.Rectangle.LLX);

}

Thanks for any help or workaround you can provide.

George

nausherwan.aslam · October 10, 2012, 5:24am

Hi George,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

Thank you for sharing the details.

I am able to reproduce your mentinoed issue after an initial test. For rectification, the issue has been registered in our issue tracking system with issue id: PDFNEWNET-34371. We will notify you via this forum thread regarding any updates against the issue.

Sorry for the inconvenience,

GeorgeH · November 9, 2012, 10:57am

Hi,

Any update available on this issue?

Thanks!

George

nausherwan.aslam · November 11, 2012, 6:02am

Hi George,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

I am afraid; we don’t have any update for you regarding your reported issue at the moment. However, I have requested the development team to share an ETA regarding the resolution of the issue. Once I get a feedback from them, I will update you via this forum thread. Please spare us some time and we will get back to you with details.

Sorry for the inconvenience,

nausherwan.aslam · November 14, 2012, 3:54pm

Hi George,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />

I have received a feedback from the development team and as per the plan; the fix for your issue will be a part of Aspose.Pdf for .NET v7.7 to be released in January 2012. However, please keep in mind that it is an ETA (not a promise) and in case there is some change in the plan by the development team (due to other priority issues), we will update you via this forum thread.

Sorry for the inconvenience,

GeorgeH · November 15, 2012, 3:42am

Many thanks for the feedback.

George

aspose.notifier · February 7, 2013, 9:35am

The issues you have found earlier (filed as PDFNEWNET-34371) have been fixed in Aspose.Pdf for .NET 7.7.0.

This message was posted using Notification2Forum from Downloads module by aspose.notifier.