I’m extracting text from PDFs together with their coordinates, and ideally would like to get the coordinates of each character individually. I’ve tried using a TextFragmentAbsorber with a regular expression that matches a single character, and this almost works, but there appears to be a bug which means that when a word has a repeated character, on the second occurrence the returned coordinates are those of the first occurrence.
Here’s a code fragment and a sample PDF (attached) which illustrate the problem. The X value for a repeated character will ‘jump backwards’.
TextFragmentAbsorber tfa = new TextFragmentAbsorber(".") { TextSearchOptions = new TextSearchOptions(true) };
Hi George,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />
Thank you for sharing the details.
I am able to reproduce your mentinoed issue after an initial
test. For rectification, the issue has been registered in our issue tracking
system with issue id:PDFNEWNET-34371. We will notify you via this forum thread regarding
any updates against the issue.
Hi
George,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />
I
am afraid; we don’t have any update for you regarding your reported issue at
the moment. However, I have requested the development team to share an ETA
regarding the resolution of the issue. Once I get a feedback from them, I will
update you via this forum thread. Please spare us some time and we will get
back to you with details.
Hi
George,<?xml:namespace prefix = o ns = "urn:schemas-microsoft-com:office:office" />
I
have received a feedback from the development team and as per the plan; the fix
for your issue will be a part of Aspose.Pdf for .NET v7.7 to be released in January
2012. However, please keep in mind that it is an ETA (not a promise) and in
case there is some change in the plan by the development team (due to other
priority issues), we will update you via this forum thread.