How can I extract all the text with co-ordinates x-y and all the information using Aspose.PDF?

AmitSinghbaghel · February 19, 2016, 1:05pm

Hi,

I want to extract text from PDF with coordinates of the text i.e. position information of the text .How can I achieve it?

like We can get HOCR result from images.

And how can I extract information like font size ,Font name etc from PDF using Aspose.PDF?

tilal.ahmad · February 21, 2016, 11:30pm

Hi Amit,

Thanks for your inquiry. You can easily extract text information from PDF document using Aspose.Pdf. Please check following documentation link for details, it will help you to accomplish the task.

Search and get text segment from PDF document.

Please feel free to contact us for any further assistance.

Best Regards,

AmitSinghbaghel · February 22, 2016, 1:20am

I want to extract whole data from PDF not just a single word.
Below line require a word to be fed.

//create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(“Figure”);

But I wish to extract each and every word from the PDF.
Please help me out.

codewarior · February 22, 2016, 9:03pm

Hi Amit,

In order to extract all data/information from PDF file, please follow the instructions specified over Extract Text from Pages using Text Device

AmitSinghbaghel · February 25, 2016, 1:26am

Hi,
That’s completely wrong answer.Buddy I want to extract text with co-ordinates from all the pages of a particular PDF.please help me out.

tilal.ahmad · February 25, 2016, 11:43pm

Hi Amit,

Thanks for your feedback. You can use same code with a slight change from the documentation link I shared in my above reply, to extract all text with the co-ordinates. Please check following code snippet, use TextFragmentAbsober() constructor without a parameter.

//create TextAbsorber object to find
all instances of the input search phrase<o:p></o:p>

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();

//accept the absorber for all the pages

pdfDocument.Pages.Accept(textFragmentAbsorber);

Please feel free to contact us for any further assistance.

Best Regards,

codewarior · February 26, 2016, 12:57am

Hi Amit,

Thanks for sharing the details and sorry for the confusion with earlier post.

The earlier shared link provides option to extract all text from PDF file but as per your requirement, you need to extract information regarding coordinates, Text, Font formatting related information from PDF file and for this purpose, you can may following instructions specified by Tilal or you may also extract text and its related information using Regular Expression. For further details, please visit Search and get Text from all pages using Regular Expression.

In case you face any issue, please share the sample PDF files and in case we are still unable to understand the requirement, please share some further details, so that we can reply accordingly.

We are sorry for this confusion and inconvenience.

AmitSinghbaghel · February 27, 2016, 1:50am

Hi,

I have got those co-ordinates .I have one more issue.

See I have converted a PDF to images means each page to an image.

I want to develop a functionality to search keywords on that image.

Is there any formula or code sample to map all these co-ordinates with pixel value so I can show that search keyword highlighted.

Did you get my point correctly?

means instead of opening that PDF in some PDF software I want to show it on Web in images .

images one by one .When I will search for some keyword .my code (sample code that I want) will convert those co-ordinates into pixel with top,left,height and width property and I will create a new div over that word and word will look like it has been searched.

Please reply asap.

codewarior · February 29, 2016, 10:54am

Amit Singh baghel:

See I have converted a PDF to images means each page to an image.
I want to develop a functionality to search keywords on that image.

Is there any formula or code sample to map all these co-ordinates with pixel value so I can show that search keyword highlighted.

Did you get my point correctly?

means instead of opening that PDF in some PDF software I want to show it on Web in images .

images one by one .When I will search for some keyword .my code (sample code that I want) will convert those co-ordinates into pixel with top,left,height and width property and I will create a new div over that word and word will look like it has been searched.
Hi Amit,

Thanks for sharing the details.

Please note that Aspose.Pdf for .NET provides the feature to create as well as manipulate existing PDF files and it does not offer the feature to perform OCR on image file, in order to search particular string pattern. Aspose.Pdf for .NET supports the feature to search individual TextFragment and with the help of CreatePolygon(…)
method of PdfContentEditor class, you can draw rectangle around each
fragment. However for text paragraph, you may consider using some
regular expression to determine the paragraph break and draw rectangle
around it. Please take a look over following code snippet.

[C#]

var document = new Document(@“C:\pdftest\36886.pdf”);<o:p></o:p>

//create TextAbsorber object to find all the phrases matching the regular expression

TextFragmentAbsorber textAbsorber = new TextFragmentAbsorber(@"[\S]+");

TextSearchOptions textSearchOptions = new TextSearchOptions(true);

textAbsorber.TextSearchOptions = textSearchOptions;

document.Pages.Accept(textAbsorber);

var editor = new PdfContentEditor(document);

foreach (TextFragment textFragment in textAbsorber.TextFragments)

{

foreach (TextSegment textSegment in textFragment.Segments)

{

DrawBox(editor, textFragment.Page.Number, textSegment, System.Drawing.Color.Red);

}

}

document.Save(@"C:\pdftest\36886-edited.pdf");

private static void DrawBox(PdfContentEditor editor, int page, TextSegment segment, System.Drawing.Color color)

{

var lineInfo = new LineInfo();

lineInfo.VerticeCoordinate = new[] {

(float)segment.Rectangle.LLX, (float)segment.Rectangle.LLY,

(float)segment.Rectangle.LLX, (float)segment.Rectangle.URY,

(float)segment.Rectangle.URX, (float)segment.Rectangle.URY,

(float)segment.Rectangle.URX, (float)segment.Rectangle.LLY

};

lineInfo.Visibility = true;

lineInfo.LineColor = color;

editor.CreatePolygon(lineInfo, page, new System.Drawing.Rectangle(0, 0, 0, 0), null);

}

Once polygon is created, you may consider converting PDF pages to Image format and then render them in web browser. You may also consider having a look over AnnotationApp from our sister company named GroupDocs which provides the feature to display and annotate certain page regions.

AmitSinghbaghel · March 30, 2016, 7:05am

This is completely wrong answer Sir.I want the relation between text co-ordinates in PDF and Left-Top position in Web browser.I have a license of Aspose.PDF .why should I go with some another tool as you suggested.Can’t you just tell me the relation instead of suggesting some other product.please?If can’t then please specify.

AmitSinghbaghel · March 30, 2016, 7:55am

FYI,

I have also tried to convert PDF text co-ordinate points by dividing the value by 72 to convert it to inches and then multiplied by 150 i.e. DPI to finally convert it to pixel values but when I provide all these values (top,left,width,height) to as absolute-positioned div in HTML ,It does not produce desired results.
Please help me out.

codewarior · March 31, 2016, 11:46pm

Hi Amit,

Thanks for sharing the details. I am further looking into above stated details and will keep you posted with my findings. We are sorry for this delay and inconvenience.

AmitSinghbaghel · November 28, 2016, 1:47am

Any updates Nayyar Shahbaz ? or I have to wait till my last ?

codewarior · November 29, 2016, 3:40am

Hi Amit,

Thanks for your patience and sorry for the delayed response.

I tried replicating the issue based on earlier shared solution while using some of my sample files but I could not exactly replicate the scenario. From your last description, you need to render pages in web browser which end user will see the images and if they want to search some TextFragment, they can search the text inside render image. In order to accomplish your requirement, you have been actually performing search inside the PDF file using TextFragmentAbsorber, using returned coordinates and then manipulating them for conversion from point to pixel, so that they can be used to draw rectangle over the image. Can you please share the sample project which you have created where you are facing issue while accomplishing this requirement. We will further look into the details of this problem and will keep you posted with our findings.

Once again, please accept our sincere apologies for this delay and inconvenience.

paulpre · December 13, 2021, 4:03pm

Hello

Pergaps I better understand his need because I have the same : for each page how to find each text with its coordinates and fonts.

Personally I need to put this information into a text file which will allow many times in the future to 1) once : index text in a database, 2) many times : generate a new pdf with exactly the same text at the same places.

I think the need is very simple but I did not find a documentation which explain things as simple.
For example is there a way to have code similar to this one :

for (Page page : (Iterable) pdfDocument.getPages()) {
for (AllTextInfo ati: page.getAllTextInfo()) {
imgX = pdfX2imgX( ati.getX() );
imgY = pdfY2imgY( ati.getY() );
font = ati.getFont();
text = ati.getText();
}
}

Your examples on github use classes and classes with names which does not represent their function for me. Very dark.
TextDevice ? TextAbsorber ? Why these names ! Why do not make things simple to manipulate simpel things ?

asad.ali · December 13, 2021, 7:54pm

@paulpre

Can you please try using the below code snippet in order to extract font information and coordinates of text inside PDF and let us know in case you face any issues:

Document document = new Document(dataDir + "input.pdf");
for(Page page: document.getPages()) {
 TextFragmentAbsorber absorber = new TextFragmentAbsorber();
 page.accept(absorber);

 for(TextFragment textFragment: absorber.getTextFragments()) {
  String fontName = textFragment.getTextState().getFont().getFontName();
  Rectangle coordinates = textFragment.getRectangle();
 }
}

paulpre · December 13, 2021, 8:16pm

Oh thank you very much, it looks like it’s easy in fact, your code looks a lot like what i need.
I’m sorry I should have persevered and not be impressed by names like absorb or accept that don’t really “speak” to me …
Thank you very much !

asad.ali · December 14, 2021, 9:16am

@paulpre

Thanks for your feedback. Please keep using our API and feel free to create a new topic in case you need further information.

paulpre · December 19, 2021, 10:14am

Hello

The code in your answer does not give the desired result (knowing the position of all the text blocks).
Even when splitting the fragments into TextSegments, the returned text is badly split and therefore some positions are not defined.

Please see the attached screenshot: the text blocks are selected to show that they are different areas.
Screenshot.PNG (3.2 KB)

With this code:
TextFragmentAbsorber visitor = new TextFragmentAbsorber();'
page.accept( visitor );
for (TextFragment tf : visitor.getTextFragments()) {
for (TextSegment tseg : tf.getSegments()) {
System.out.println( String.format( "- '%s'", tseg.getText() ) );
}}

Your lib concatenates some areas and returns:
- 'Account'
- 'Prior Period Current Month Actual' <<< bad
- 'Number Description' <<< bad
- 'Balance Actuals YearTo Date' <<< bad

Using a very old Adobe lib (that we want to replace with Aspose), a text enumeration returns the correct split:
- 33.24:638.25:64.93:638.25:'Account'
- 314.51:638.25:360.66:638.25:'Prior Period'
- 408.05:638.25:463.45:638.25:'Current Month'
- 522.23:638.25:546.29:638.25:'Actual'
- 33.24:625.89:63.58:625.89:'Number'
- 89.25:625.89:133.37:625.89:'Description'
- 322.31:625.89:353.03:625.89:'Balance'
- 421.49:625.89:450.06:625.89:'Actuals'
- 510.96:625.89:557.52:625.89:'YearTo Date'
This shows that the PDF document has separate zones corresponding to the selected text of the image, and the coordinates of each zone are correct (in our application, they allow us to reconstruct the PDF as the original).

This seems to reveal a bug in your lib, but anyway can you point us to another way to get a correct enumeration of the text?

Thank you in advance for your help

asad.ali · December 19, 2021, 6:43pm

@paulpre

Could you please share the sample PDF as well for our reference? We will test the scenario in our environment and address it accordingly.