Problem
We are trying to retrieve text and location rectangle boundaries from Aspose, but the results we are getting aren’t detailed enough for us to interpret and process. Strange word ordering and text combinations are being returned to us in overly large rectangles. We’ve checked the source PDF and compared against other libraries, and the problem doesn’t exist in the original field definitions, just when using Aspose.
Using the TextFragment absorber to get text + location rectangles, we are having a problem where text “to the right” is prepended in a region to text “to the left” apparently because the right text is 1 unit higher (when looking at the raw PDF field definitions in a different tool). This is preventing us from using Aspose to extract text from relevant PDF documents and translate them via word position the way we do with OCR or other PDF libraries even though we’d prefer to use Aspose for business reasons.
Desired Outcome
-
We need the text and the location rectangle around it, but we don’t need to combine text across whitespace like this.
The biggest problem is text on the right side of the page is coming before text on the left. (details below – right text is 1 point unit higher so “is before”.)
The second biggest problem is unrelated text separated by lots of whitespace is being improperly combined.
Would it be possible to get a bug-fix or different API to get the text ordered left->right or not combined in these cases? Or perhaps we’re using existing APIs incorrectly?
-
The raw PDF field data would be just fine, no need to be fancy – we can combine fragments, but it’s hard to split the existing combined text.
The PDFs we are reading have text fragments that are logical coming from the PDF printer, we lose that once Aspose processes them in the TextFragment API.
Issue Details
Aspose Results:
X |
Y |
Width |
Height |
Text |
94 |
50 |
351 |
10 |
Organisasjonsnr:TOLLREGION OSLO OG AKERSHUS |
377 |
193 |
145 |
10 |
981641205Kundenummer: |
111 |
355 |
439 |
10 |
49 832Sum deklarasjoner |
Screenshot of the PDF data (pdf file attached)
Comments
These text items shouldn’t be pushed into a single string.
If they are, they should be in left -> right order even though the right is 1 point unit higher than the left item
But perhaps the TextFragment API is the wrong API? We need text + location rectangles, but are happy with the unprocessed raw PDF data… We don’t need to go through regex filtering or layout processing.
Current Code (C#)