TextFragmentAbsorber questions

Hi Aspose.

I am planning to use Aspose.Pdf to detect whether image hides any text. I have a few questions re text absorbers.

Question 1: Z-index of text elements

For my algorithm, I need to know Z-index. I use TextFragmentAbsorber to extract text from a PDF.

TextFragmment.ZIndex property always equals 0 and there is no property that holds a corresponding text-showing operator (for example, Tj or TJ operator).

For images, I can get Z-index from the ImagePlacement.Operator.Index property.

  • Why is TextFragmment.ZIndex equal 0?
  • Can I get text showing the operator and its index (like it is possible for image placement)?

Question 2: Bounding box of the text elements

Also, I need to know position and size of the text elements. TextFragmment class has the Rectangle property, but it seems that Aspose.Pdf does not take the text transformation matrix into account, because the returned rectangle data describe text before it was rotated.

  • Is it a bug?
  • If not, does Aspose provide a way to calculate bounding box for text elements that were transformed (rotated, etc.)?

Oleksii
Thanks.

@oleksii.diachok,

Sadly there is no way to know the index of the text.

You can refer to the documentation for more information about what you can do.

For question 2, can you provide a document and the code you are using?

Sadly there is no way to know the index of the text.

Please register an enhancement request in the system.


For question 2, can you provide a document and the code you are using?

Attaching PDF with a rotated text to illustrate that Aspose.PDF does not take rotation into account.
Bounding box of the text.pdf (9.2 KB)

var textFragmentAbsorber = new TextFragmentAbsorber();
page.Accept(textFragmentAbsorber);
foreach (var fragment in textFragmentAbsorber.TextFragments)
{
    foreach (var segment in fragment.Segments)
    {
        //fragment .Rectangle
        //segment .Rectangle
    }
}

@oleksii.diachok,

The Free support forum for Aspose is not for requesting enhancements or new functionality. You can do that in the Paid Support forums.

If I see a current functionality broken, I can create a ticket for the Dev team so they can review it and fix if it is broken. I am currently checking the second question.

@oleksii.diachok,

You are right, the rectangle does not represent the current state of the text.

In order to have a proper visual representation I used an annotation around the text, and we can clearly see the shape of the rectangle.

Here is the code I used:

private void Logic()
{
    var doc = new Document($"{PartialPath}_input.pdf");
    
    var page = doc.Pages[1];            

    MarginInfo marginInfo = new MarginInfo();
    marginInfo.Left = 0;
    marginInfo.Right = 0;
    marginInfo.Top = 0;
    marginInfo.Bottom = 0;

    page.PageInfo.Margin = marginInfo;

    var textFragmentAbsorber = new TextFragmentAbsorber();
    page.Accept(textFragmentAbsorber);

    foreach (var fragment in textFragmentAbsorber.TextFragments)
    {
        foreach (var segment in fragment.Segments)
        {
            var annotSegment = new RedactionAnnotation(page, new Aspose.Pdf.Rectangle((float)segment.Rectangle.LLX, (float)segment.Rectangle.LLY, (float)segment.Rectangle.Width, (float)segment.Rectangle.Height));
            annotSegment.FillColor = Color.Red;
            annotSegment.Color = Color.Red;
            page.Annotations.Add(annotSegment);
            annotSegment.Redact();
        }
    }

    doc.Save($"{PartialPath}_output.pdf");
}

And here are the input and output:
RectangleAroundText_input.pdf (9.2 KB)
RectangleAroundText_output.pdf (8.0 KB)

I will be creating a ticket for the dev team.

@oleksii.diachok
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-54100

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

As I understand, PDFNET-54100 is for wrong values in the text Rectangle. Correct?

Please also create ticket for ZInfex bug. As I mentioned earlier, it is always 0 for text fragments/segments.

@oleksii.diachok,

I was able to replicate the index with the following code:

private void Logic()
{            
    Document doc = new Document();

    TextFragment textFragment = new TextFragment("Hello World!");
    textFragment.TextState.Font = FontRepository.FindFont("Arial");
    textFragment.TextState.FontSize = 12;
    textFragment.TextState.ForegroundColor = Color.Black;
    textFragment.Position = new Position(100, 100);
    textFragment.ZIndex = 1;

    var page = doc.Pages.Add();
    page.Paragraphs.Add(textFragment);

    doc.Save($"{PartialPath}_output.pdf");

    var tfa = new TextFragmentAbsorber();
    page.Accept(tfa);

    int count = 0;
    foreach (var fragment in tfa.TextFragments)
    {
        count++;
        Console.WriteLine($"Frag {count}: {fragment.Text} - ZIndex:{fragment.ZIndex}");
    }

    Console.ReadKey();            
}

I will be creating a bug for the dev team.

@oleksii.diachok
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-54265

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.