PDF Text Extraction

talhamalik97 · January 28, 2021, 10:02am

Hello I have attached two documents, one is the original and the second one is the copy with a rectangle at the area where i want to extract the text. The code i am using is:

public string ExtractText(Page page, Rectangle rect)
{
Aspose.Pdf.Text.TextAbsorber textAbsorber = new Aspose.Pdf.Text.TextAbsorber
{
TextSearchOptions =
{
LimitToPageBounds = true,
Rectangle = rect
}
};
page.Accept(textAbsorber);
return textAbsorber.Text.Trim();
}

You just give the coordinates and extract the text from there. It is working fine for other documents but for this particular document it is not able to extract the text from the position. Any particular reason why is it happening and how can i resolve this?

Thank You.

talhamalik97 · January 28, 2021, 10:12am

document59357 - Copy (1).pdf (7.3 MB)
document59357.pdf (7.4 MB)

asad.ali · January 28, 2021, 8:01pm

@talhamalik97

You are unable to extract text because PDF does not contain any text in its pages except “PLEASE PRINT OR TYPE” until Page # 32. You can verify it by searching in the Adobe Reader or using the following code snippet:

Document doc = new Document(dataDir + "document59357 - Copy (1).pdf");
TextAbsorber textAbsorber = new TextAbsorber();
doc.Pages.Accept(textAbsorber);
var txt = textAbsorber.Text;