Extract position of text in PDF using Aspose.PDF for .NET

bjornar.elgetun · December 8, 2015, 3:11pm

Problem

We are trying to retrieve text and location rectangle boundaries from Aspose, but the results we are getting aren’t detailed enough for us to interpret and process. Strange word ordering and text combinations are being returned to us in overly large rectangles. We’ve checked the source PDF and compared against other libraries, and the problem doesn’t exist in the original field definitions, just when using Aspose.

Using the TextFragment absorber to get text + location rectangles, we are having a problem where text “to the right” is prepended in a region to text “to the left” apparently because the right text is 1 unit higher (when looking at the raw PDF field definitions in a different tool). This is preventing us from using Aspose to extract text from relevant PDF documents and translate them via word position the way we do with OCR or other PDF libraries even though we’d prefer to use Aspose for business reasons.

Desired Outcome

We need the text and the location rectangle around it, but we don’t need to combine text across whitespace like this.
- The biggest problem is text on the right side of the page is coming before text on the left. (details below – right text is 1 point unit higher so “is before”.)
- The second biggest problem is unrelated text separated by lots of whitespace is being improperly combined.
Would it be possible to get a bug-fix or different API to get the text ordered left->right or not combined in these cases? Or perhaps we’re using existing APIs incorrectly?
The raw PDF field data would be just fine, no need to be fancy – we can combine fragments, but it’s hard to split the existing combined text.
- The PDFs we are reading have text fragments that are logical coming from the PDF printer, we lose that once Aspose processes them in the TextFragment API.

Issue Details

Aspose Results:

X	Y	Width	Height	Text
94	50	351	10	Organisasjonsnr:TOLLREGION OSLO OG AKERSHUS
377	193	145	10	981641205Kundenummer:
111	355	439	10	49 832Sum deklarasjoner

Screenshot of the PDF data (pdf file attached)

Comments

These text items shouldn’t be pushed into a single string.
If they are, they should be in left -> right order even though the right is 1 point unit higher than the left item
But perhaps the TextFragment API is the wrong API? We need text + location rectangles, but are happy with the unprocessed raw PDF data… We don’t need to go through regex filtering or layout processing.

Current Code (C#)

codewarior · December 10, 2015, 2:40am

Hi Bjornar,

Thanks for using our API’s.

Can you please share the code snippet which you are using, so that we can test the scenario in our environment. We are sorry for this inconvenience.

JoelMcIntyre · December 28, 2015, 9:41am

Here's the code we're using. (We convert PDF page coords to top left before printing them out so the above coordinates are from a top left origin, but the below is still in bottom left PDF units.)

---

Document doc = new Document(Filename);

TextFragmentAbsorber text = new TextFragmentAbsorber(@".+", new TextSearchOptions(true));

doc.Pages.Accept(text);

foreach (TextFragment fragment in text.TextFragments)

{

// Extract text and create TextRow

TextRow row = new TextRow();

row.Text = fragment.Text;

row.Page = fragment.Page.Number - 1;

Rectangle rect = fragment.Rectangle;

row.X = (int)Math.Round(rect.LLX);

row.Y = (int)Math.Round(rect.LLY);

row.Width = (int)Math.Round(rect.Width);

row.Height = (int)Math.Round(rect.Height);

}

codewarior · December 29, 2015, 3:21am

Hi Bjornar,

Thanks for sharing the details.

I have tried executing the code but I am afraid TextRow object is not defined. Can you please share some sample project, which can help us in replicating the issue in our environment.

JoelMcIntyre · December 29, 2015, 10:11am

Ah, sorry, TextRow was just our local object to catch the data we were interested in for later processing.

Below is a simpler version w/out TextRow that dumps the info to the Output window in a format easy to read or copy into Excel.

And as a reminder, the major problem we are having is that words displayed Left to Right as two separate pdf file fragments (Kundenummer: 981641205) are being returned Right to Left as one Aspose Text fragment because the right side is 1 ppi higher (981641205Kundenummer:).

---

Document doc = new Document(Filename);

TextFragmentAbsorber text = new TextFragmentAbsorber(@".+", new TextSearchOptions(true));

doc.Pages.Accept(text);

foreach (TextFragment fragment in text.TextFragments)

{

int X = (int)Math.Round(rect.LLX);

int Y = (int)Math.Round(rect.URY);

int Width = (int)Math.Round(rect.Width);

int Height = (int)Math.Round(rect.Height);

string Text = fragment.Text;

System.Diagnostics.Debug.WriteLine(string.Format("{0,4} \t{1,4} \t{2,4} \t{3,4} \t{4}", X, Y, Width, Height, Text));

}

codewarior · December 30, 2015, 9:08am

Hi Bjornar,

Thanks for sharing the details.

I
have tested the scenario and I am able to notice the same problem. For the sake
of correction, I have logged this problem as PDFNEWNET-40068 in
our issue tracking system. We will further look into the details of this
problem and will keep you updated on the status of correction. Please be
patient and spare us little time. We are sorry for this inconvenience.

aspose.notifier · April 5, 2019, 9:39pm

The issues you have found earlier (filed as PDFNET-40068) have been fixed in Aspose.PDF for .NET 19.4.