Text incorrectly split into multiple fragments

We are using the latest version of Aspose.Pdf for .NET, 6.9.0.0.

Execute the code below on the attached PDF. This will dump the PDF operators and text fragments as reported by Aspose.Pdf to the screen. You will notice quickly that there are a number of cases in which Aspose.Pdf is incorrectly splitting up text that should go together. For instance, there should be one fragment with text of “CASE INFORMATION” as opposed to two fragments with text of “CASE INFORM” and “ATION” respectively.

The cause appears to be that if there are multiple consecutive Tj/TJ operators, Aspose.Pdf assigns each to its own fragment instead of combining them. For example:

(CASE INFORM) Tj
[(A) 74 (TION)] TJ

becomes

fragment at 261,610: 'CASE INFORM’
fragment at 322,610: 'ATION’

This behavior is at variance with Adobe’s PDF reference, 3rd edition, which states on page 312:

The grouping of glyphs into strings has no significance; showing multiple glyphs with one invocation of a text-showing operator such as Tj produces the same results as showing them with a separate invocation for each glyph.

This problem is currently breaking our production application. If you could provide a timely patch for this issue (either by automatically grouping consecutive text display operators into one text fragment, or providing a flag to do this under TextSearchOptions) that would be greatly appreciated.

Thanks,

David Pecora

-----

Aspose.Pdf.License license = new Aspose.Pdf.License();
license.SetLicense(“Aspose.Pdf.lic”);

Document pdfDocument = new Document(@"\path\to\attached\pdf");

var coll = pdfDocument.Pages[1].Contents;
for (int i = 1; i <= coll.Count; i++) {
Console.WriteLine(coll[i].ToString());
}

TextFragmentAbsorber absorber = new TextFragmentAbsorber();
absorber.TextSearchOptions = new TextSearchOptions(true);
pdfDocument.Pages[1].Accept(absorber);

foreach (TextFragment fragment in absorber.TextFragments) {
Console.WriteLine(“fragment at {0},{1}: ‘{2}’”,
Math.Round(fragment.Rectangle.LLX),
Math.Round(fragment.Rectangle.LLY),
fragment.Text);
}

Hello David,


Thanks for using our products.

I
have tested the scenario and I am able to reproduce the same problem. For the
sake of correction, I have logged it in our issue tracking system as PDFNEWNET-33568. We will investigate this
issue in details and will keep you updated on the status of a correction.

We apologize for your inconvenience.

Hi David,


Thanks for your patience. Our product team has investigated the issue and would like to request you to please take into account citation from PDF reference relates only for display of text on page of PDF-document.

The complete citation is:
"The grouping of glyphs into strings has no significance for the display of text. Showing multiple glyphs with one invocation of a text-showing operator such as Tj produces the same results as showing them with a separate invocation for each glyph."
Adobe’s PDF reference, 6th edition, page 409.
Adobe’s PDF reference, 1st edition, page 251.

It establishes no rules about text fragment extraction. TextFragmentAbsorber with no parameters extracts text segments as fragments (in according to operators). It is a feature of TextFragmentAbsorber not a bug. No variance with PDF reference is present. If you want another distribution of text between fragments you may use regex patterns.

Please use TextFragmentAbsorber to find and highlight whole words as following, it will help you to accomplish the task.


//open document<o:p></o:p>

Document pdfDocument = new Document(myDir

  • “MJ-23101-NT-0000136-2012.pdf”);<o:p></o:p>

    //get page<o:p></o:p>

    Page page = pdfDocument.Pages[1];<o:p></o:p>

    //create TextFragmentAbsorber object to
    find all words
    <o:p></o:p>

    TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(@"[\S]+");<o:p></o:p>

    textFragmentAbsorber.TextSearchOptions.IsRegularExpressionUsed
    = true;<o:p></o:p>

    page.Accept(textFragmentAbsorber);<o:p></o:p>

    //loop through the fragments<o:p></o:p>

    foreach (TextFragment
    textFragment in
    textFragmentAbsorber.TextFragments)<o:p></o:p>

    {<o:p></o:p>

    DefaultAppearance defaultAppearance = new DefaultAppearance(“Arial”, 8, System.Drawing.Color.Red);<o:p></o:p>


    Aspose.Pdf.Rectangle rect =
    (Aspose.Pdf.Rectangle)textFragment.Rectangle.Clone();<o:p></o:p>

    FreeTextAnnotation freeText = new FreeTextAnnotation(page,
    rect, defaultAppearance);<o:p></o:p>


    freeText.Border = new Border(freeText);<o:p></o:p>


    freeText.Color = Aspose.Pdf.Color.FromRgb(System.Drawing.Color.Transparent);<o:p></o:p>

    page.Annotations.Add(freeText);<o:p></o:p>

    }<o:p></o:p>

    pdfDocument.Save(myDir + “33568_highlighted.pdf”);<o:p></o:p>

Please feel free to contact us for any further assistance.


Best Regards,