Problems with Text absorber with pdf generated by Amyuni pdf converter

StefanoR · March 16, 2023, 9:25am

We use Aspose pdf for .net with Text absorber, to extract text from specific regions of a pdf page. We havo problems with a customer wich produces Pdf from AS/400 using a software (Laser 400) based on Amyuni Pdf Converter. We have two categories of files with errors, in the first (Gel_Bolla_ven.pdf)Text absorber seems find no text (in some points in debug we see extracted a serie of \0 only), in the second (Gel_Ord_For.pdf) we find the fixed text (these are invoices generated from a template filled with specific values in the fields) but not the values.
We use absorber with a rectangle, we tried also with a rectangle which covers the whole page,

                var absorber = new TextAbsorber
                {
                    TextSearchOptions =
                    {
                        LimitToPageBounds = true,
                        Rectangle = new Aspose.Pdf.Rectangle(left, pdf.PageInfo.Height - top, right, pdf.PageInfo.Height - bottom)
                    }
                };
                // Accept the absorber for page (1-based)
                pdf.Pages[nPage + 1].Accept(absorber);
We tried also to set the options
                        SearchForTextRelatedGraphics,
                        UseFontEngineEncoding

but with no result
Gel_Bolla_Ven.pdf (544.4 KB)
Gel_Ord_For.pdf (517.6 KB)

carlos.molina · March 16, 2023, 2:35pm

@StefanoR,

Can you give me a code snippet that I can run, please?

You are missing some lines in order for me to run your code and replicate the issue.

StefanoR · March 16, 2023, 3:04pm

Here it is, thanks

Program.zip (590 Bytes)

carlos.molina · March 16, 2023, 5:14pm

@StefanoR,

I tried TextAbsorber and TextFragmentAbsorber but none worked properly on this PDF document. I will create a ticket for the dev team.

This is the code I used to read it. I drawed a rectangle on top of the text just to know if the coordinated where the correct ones.

private void Logic()
{
    Document doc = new Document($"{PartialPath}_input.pdf");

    var page = doc.Pages[1];


    var ta = new TextAbsorber();
    ta.TextSearchOptions.Rectangle = new Aspose.Pdf.Rectangle(310, 630, 210, 55);
    ta.TextSearchOptions.LimitToPageBounds = true;
    ta.TextSearchOptions.SearchForTextRelatedGraphics = false;

    page.Accept(ta);
    Console.WriteLine($"Text: {ta.Text}");

    var tfa = new TextFragmentAbsorber();
    tfa.TextSearchOptions.Rectangle = new Aspose.Pdf.Rectangle(310, 630, 210, 55);
    tfa.TextSearchOptions.LimitToPageBounds = true;
    tfa.TextSearchOptions.SearchForTextRelatedGraphics = false;
    page.Accept(tfa);

    int count = 0;
    foreach (var fragment in tfa.TextFragments)
    {
        count++;
        Console.WriteLine($"Frag {count}: {fragment.Text}");
    }

    var pageInfo = page.PageInfo;
    var marginInfo = page.PageInfo.Margin;
    var graph = new Graph((float)pageInfo.Width, (float)pageInfo.Height);
    graph.Left = marginInfo.Left * -1;
    graph.Top = marginInfo.Top * -1;
    page.Paragraphs.Add(graph);
    var rectangle = new Aspose.Pdf.Drawing.Rectangle(310, 630, 210, 55);
    rectangle.GraphInfo.FillColor = Aspose.Pdf.Color.Red;
    rectangle.GraphInfo.Color = Aspose.Pdf.Color.Black;
    graph.Shapes.Add(rectangle);

    // Save output PDF document
    doc.Save($"{PartialPath}_output.pdf");
}

carlos.molina · March 16, 2023, 5:18pm

@StefanoR
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-53952

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.