Expected one TextFragment but got two

jhested · May 16, 2018, 12:26pm

Hi,

Im using your c# pdf lib to read invoices from customers, and i have stumbled on a strange issue.

When i extract textfragments from a location, i get two where i would expect one. The wierd thing is that the second textfragment also contains the content from the first.

Lets say the textfragment i want to extract is ‘1061900175’. When i extract the fragment from that location i get two fragments:
106
1061900175

There is no other text in that area that contains ‘106’.

I extract the text using TextFragmentAbsorber with TextSearchOptions.Rectangle defined

I have attached a screenshot from the pdf(only the area of interest).

Hope you can help

Udklip.jpg (13.6 KB)

asad.ali · May 16, 2018, 2:57pm

@jhested

Thanks for contacting support.

Would you please share your sample PDF document along with the code snippet which you are trying to extract the text. We will test the scenario in our environment and address it accordingly.

jhested · May 16, 2018, 3:37pm

35525707.pdf (171.1 KB)
Udklip.jpg (314.5 KB)
Hi,

Thanks for your fast response, attached is the pdf, plus a screenshot of the text im trying to read.

I have also attached the code to reproduce the behavior in the .zip file

sample.zip (587 Bytes)

asad.ali · May 16, 2018, 7:39pm

@jhested

Thanks for sharing the sample document and code snippet.

We have tested the scenario by using your document and code snippet with Aspose.PDF for .NET 18.5. We were unable to notice more than one extracted text fragment with the given rectangle value to TextFragmentAbsorber. Please check complete code snippet used for testing:

Document doc = new Document(dataDir + "35525707.pdf");
Aspose.Pdf.Rectangle rectangle = new Rectangle(65.179, 484.73199999999997, 313.179, 504.63199999999995);
// Create TextAbsorber object to extract text
TextFragmentAbsorber absorber = new TextFragmentAbsorber();
absorber.TextSearchOptions.LimitToPageBounds = true;

if (rectangle != null)
   absorber.TextSearchOptions.Rectangle = rectangle;

// Accept the absorber for first page
doc.Pages[1].Accept(absorber);
var textFragments = absorber.TextFragments;
Console.WriteLine(textFragments.Count);

Would you please try your scenario with latest version of the API and in case you still face similar issue, please share a sample console application, which is able to reproduce the error in any environment. We will again test the scenario and address it accordingly.

jhested · May 16, 2018, 9:42pm

Hi,

Thanks for your response.

After moving some off the code to a console test app, i ended up finding the bug.
In another part of the code, i was accidently changing fragment.Text, which caused the bug to appear later.

I have now changed my code to Clone() TextFragments where i need to change and alter, without changing the original document.

Thank you for your support.

Regards.
Jimmi

asad.ali · May 17, 2018, 5:56pm

@jhested

It is good to know that you managed to find the bug in your existing code and resolve it. Please keep using our API and in case you face any other issue, please feel free to create a new topic in our support forums. We will be happy to assist you accordingly.

jhested · May 18, 2018, 8:49am

Hi again,

I stumbled upon a similar issue, but this time I managed to isolate it in a test console project. I’m not sure if I should post it here or in a new topic.

I’m trying to read what I would expect is one text fragment, but I get two.

I have attached the sample console project, along with a screenshot of the value I’m trying to read.

EDIT:
For some reason I cannot attach the .zip file. It goes to 100% but does not get attached?

Google Drive link: ("[[BL]]https://drive.google.com/open?id=10ODiiheJ-8u7Fec6x32J6BsZvJ5jyrhE[[/BL]]

Regards
Jimmi

asad.ali · May 18, 2018, 4:09pm

@jhested

Thanks for sharing a sample project.

Please note that maximum upload size allowed by forums is 3MB, which was why you were unable to attached your sample project.

The TextFragmentAbsorber extracts text fragments from the PDF document and text is being extracted in the form it was added inside PDF. It seems that the text (i.e. 4868901-Jh-hellebjerg /JUELSMINDE) was added using two different text fragments, therefore it is being retrieved in same manners.

Nevertheless, we have logged an investigated ticket as PDFNET-44723 in our issue tracking system. We will further investigate this behavior of the API and keep you informed with the status of ticket resolution. Please be patient and spare us little time.