Possible bugs in TextFragmentAbsorber

Hello,


I’m using Aspose.PDf 6.1.0.0 for extracting text fragments from documents by TextFragmentAbsorber. In attached document I found couple of freaky TextFragments:
1. Fragments on the start of document have TextState.FontSize = 0, I think that this is not valid value.

2. There are few columns in the table on the page, but first row in table is parsed by Aspose as one TextFragment with just one TextSegment, this Segment contains text “KódMísto ur” (first column with part of second one together) and the Ractangle of TextFragment have both Height and Width property set to zero. I expect that this is the case of wrong parsed text.

3. Fragments with text “určení liší.” and “V návodu se pro označení” have absolutely same position despite of these are on diferent lines.

Thank You for your response
Prokop

jan@valenta.cz:
1. Fragments on the start of document have TextState.FontSize = 0, I think that this is not valid value.


Hello Jan,

Thanks for using our products.

I have tested the scenario and I am able to notice the same problem. For the sake of correction, I have logged it as PDFNEWNET-30639 in our issue tracking system. We will further look into the details of this issue and will keep you updated on the status of correction. Please be patient and spare us little time. We apologize for your inconvenience.


jan@valenta.cz:
2. There are few columns in the table on the page, but first row in table is parsed by Aspose as one TextFragment with just one TextSegment, this Segment contains text "KódMísto ur" (first column with part of second one together) and the Ractangle of TextFragment have both Height and Width property set to zero. I expect that this is the case of wrong parsed text. .


Can you please share the code snippet to reproduce this issue.

jan@valenta.cz:
3. Fragments with text "určení liší." and "V návodu se pro označení" have absolutely same position despite of these are on diferent lines.


I have managed to reproduce this problem and have logged it as PDFNEWNET-30640 in our issue tracking system. As soon as the issue is resolved, we would be more than happy to update you with the status of correction.

Hello Nayyer,


Thanks for your reply, here is my code snippet to help you replicate also 2nd problem. Problem is that both columns in first table row are merged to one segment.

Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(path);

for (int p = 1; p <= pdfDocument.Pages.Count; p++)
{
//create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
//accept the absorber for all the pages
pdfDocument.Pages[p].Accept(textFragmentAbsorber);

//get the extracted text fragments
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

StringBuilder builder = new StringBuilder();

foreach (TextFragment fragment in textFragmentCollection)
{
builder.AppendLine(“T: " + fragment.Text + “; R: " + fragment.Rectangle.ToString());

foreach (TextSegment segment in fragment.Segments)
{
builder.AppendLine(”----T:” + segment.Text + "; P: " + segment.Position.ToString());
}
}

File.WriteAllText(@“C:\logAspose.txt”, builder.ToString(), Encoding.UTF8);
}


The issues you have found earlier (filed as PDFNEWNET-30639) have been fixed in this update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.

The issues you have found earlier (filed as PDFNEWNET-30640) have been fixed in Aspose.Pdf for .NET 9.8.0.


This message was posted using Notification2Forum from Downloads module by Aspose Notifier.