Capturing and extracting content from a particular zone in PDF using Aspose.PDF C#

santoshp1989 · August 4, 2020, 4:49pm

Hi Aspose Team,

We are facing a problem while indexing(capturing text) a PDF file for a particular zone.

Below is the code where you can find the zone values:-

RedactionAnnotation annot = new RedactionAnnotation(pdfDocument.Pages[i], new Aspose.Pdf.Rectangle(92.16, 509.76, 391.68, 601.92));

Problem:-

-> When we are trying to extract the particular zone value, it is not capturing completely. If there are three lines in that zone then two lines only coming. The 1st line is not coming.

-> The 1st line is coming on the zone line which Aspose.PDF unable to read that line. We want the entire content inside the co-ordinate and on the co-ordinate also.

-> We are capturing the text for a particular zone and save it in a csv file. After seeing the csv file there is only two lines not three lines but actually there is three lines are there in the pdf file in the given zone.

-> Could you please find the attached zip file where you can find the input.pdf and output.pdf.
input.zip (723.8 KB)

-> In output.pdf you can find the zone in yellow color.

Code:-

public void getText()
{
Aspose.Pdf.License licencepd = new Aspose.Pdf.License();
licencepd.SetLicense(Convert.ToString(ConfigurationManager.AppSettings[“AsposeLic”]));

        string directory = @"D:\";
        string filename = "input.pdf";

        using (Document pdfDocument = new Document(directory + filename))
        {
            int count = pdfDocument.Pages.Count;

            for (int i = 1; i <= count; i++)
            {
                //Create TextAbsorber object to extract text
                Aspose.Pdf.Text.TextAbsorber absorber = new Aspose.Pdf.Text.TextAbsorber();
                absorber.TextSearchOptions.LimitToPageBounds = false;

                absorber.TextSearchOptions.Rectangle = new Aspose.Pdf.Rectangle(92.16, 509.76, 391.68, 601.92);

                // Accept the absorber for first page
                pdfDocument.Pages[i].Accept(absorber);
                // pdfDocument.Pages[1].Accept(absorber);

                //  Get the extracted text
                string extractedText = absorber.Text;

                // Create RedactionAnnotation instance for specific page region
                RedactionAnnotation annot = new RedactionAnnotation(pdfDocument.Pages[i], new Aspose.Pdf.Rectangle(92.16, 509.76, 391.68, 601.92));
                annot.FillColor = Aspose.Pdf.Color.LightYellow;
                annot.BorderColor = Aspose.Pdf.Color.Green;
                annot.Color = Aspose.Pdf.Color.Blue;
                Border border = new Border(annot);
                border.Width = 5;
                border.Dash = new Dash(1, 1);

                annot.Border = border;
                annot.Rect = new Aspose.Pdf.Rectangle(92.16, 509.76, 391.68, 601.92);

                // Text to be printed on redact annotation
                annot.OverlayText = "REDACTED";
                annot.TextAlignment = Aspose.Pdf.HorizontalAlignment.Center;
                // Repat Overlay text over redact Annotation
                annot.Repeat = false;
                // Add annotation to annotations collection of first page
                pdfDocument.Pages[i].Annotations.Add(annot);
                // Flattens annotation and redacts page contents (i.e. removes text and image
                // Under redacted annotation)
                annot.Redact();
            }

            directory = directory + "\\output.pdf";
            pdfDocument.Save(directory);

        }
    }

Note:-

-> We are using Aspose.Total 18.5 for .Net.

Please help us on the above issue. We are in a critical condition.

Thanks & Regards,
Santosh Kumar Panigrahi

asad.ali · August 4, 2020, 10:19pm

@santoshp1989

We were able to observe that API was not extracting text completely and logged an issue as PDFNET-48618 in our issue tracking system. We will further look into reasons behind this issue and keep you posted with the status of its rectification. Please be patient and spare us some time.

We are sorry for the inconvenience.

PS: We tested the scenario with Aspose.PDF for .NET 20.7

santoshp1989 · August 6, 2020, 3:29pm

Hi Aspose Team,

Any update on the above issue?

Could you please help us ASAP. We are in a critical situation.

Thanks & Regards,
Santosh Kumar Panigrahi

asad.ali · August 6, 2020, 9:09pm

@santoshp1989

The issue has recently been logged in our issue management system and is pending for investigation. It will be analysed and resolved on first come first serve basis. We will surely inform you as soon as we make some definite progress towards its resolution. Please have patience and spare us some time.

We are sorry for the inconvenience.

santoshp1989 · August 12, 2020, 3:53pm

Hi Aspose Team,

Any update on the above issue?

Could you please help us ASAP. We are in a critical situation.

Thanks & Regards,
Santosh Kumar Panigrahi

asad.ali · August 12, 2020, 7:53pm

@santoshp1989

Regretfully, there is no update yet. We have recorded your concerns and will surely consider them during investigation. We will inform you as soon as we have some certain news about ticket resolution. We greatly appreciate your patience in this matter.

PS: You may please check our priority support option as well in case the issue is a showstopper for you.

asad.ali · September 15, 2022, 9:18pm

@santoshp1989

The earlier logged ticket has been investigated. It isn’t a bug. Search rectangle covers part of the “Testing from” text rectangle. See: Initial_rectangle.png

RedactionAnnotation removes partially overlapped text. Because this function is designed to reliably hide text content.

But the TextAbsorber requires more accurate rectangles.

It should to expand search rectangle for including whole text line:
new Aspose.Pdf.Rectangle(92.16, 509.76, 391.68, 607.68)

Please consider the following code:

string dataDir = @"D:\";
string filename = "input.pdf";

using (Document pdfDocument = new Document(dataDir + filename))
{
    int count = pdfDocument.Pages.Count;
    Rectangle rect = new Aspose.Pdf.Rectangle(92.16, 509.76, 391.68, 607.68);

    for (int i = 1; i <= count; i++)
    {
        //Create TextAbsorber object to extract text
        Aspose.Pdf.Text.TextAbsorber absorber = new Aspose.Pdf.Text.TextAbsorber();
        absorber.TextSearchOptions.LimitToPageBounds = false;

        absorber.TextSearchOptions.Rectangle = rect;

        // Accept the absorber for the page
        pdfDocument.Pages[i].Accept(absorber);                    

        //  Get the extracted text
        string extractedText = absorber.Text;

        //  Save the extracted text
        string outTextPath = String.Format("{0}48618_p{1}_out.txt", dataDir, i.ToString());
        File.WriteAllText(outTextPath, extractedText);

        //  Draw rectange on the page. For diagnostic purpose.
        DrawRectangleOnPage(rect, pdfDocument.Pages[i]);
    }

    //  Save modified (with drawn rectangle) document
    filename = dataDir + "48618_output_corrected.pdf";
    pdfDocument.Save(filename);
}

It extracts expected text: 48618_p1_out.txt

files.zip (807.4 KB)