Hi,
Hi Amit,
I want to extract whole data from PDF not just a single word.
Below line require a word to be fed.
//create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber(“Figure”);
But I wish to extract each and every word from the PDF.
Please help me out.
Hi Amit,
Hi,
That’s completely wrong answer.Buddy I want to extract text with co-ordinates from all the pages of a particular PDF.please help me out.
Hi Amit,
//create TextAbsorber object to find
all instances of the input search phrase<o:p></o:p>
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
//accept the absorber for all the pages
pdfDocument.Pages.Accept(textFragmentAbsorber);
Hi Amit,
Hi,
Amit Singh baghel:See I have converted a PDF to images means each page to an image.I want to develop a functionality to search keywords on that image.Is there any formula or code sample to map all these co-ordinates with pixel value so I can show that search keyword highlighted.Did you get my point correctly?means instead of opening that PDF in some PDF software I want to show it on Web in images .images one by one .When I will search for some keyword .my code (sample code that I want) will convert those co-ordinates into pixel with top,left,height and width property and I will create a new div over that word and word will look like it has been searched.Hi Amit,Thanks for sharing the details.Please note that Aspose.Pdf for .NET provides the feature to create as well as manipulate existing PDF files and it does not offer the feature to perform OCR on image file, in order to search particular string pattern. Aspose.Pdf for .NET supports the feature to search individual TextFragment and with the help of CreatePolygon(…)
method of PdfContentEditor class, you can draw rectangle around each
fragment. However for text paragraph, you may consider using some
regular expression to determine the paragraph break and draw rectangle
around it. Please take a look over following code snippet.[C#]var document = new Document(@“C:\pdftest\36886.pdf”);<o:p></o:p>
//create TextAbsorber object to find all the phrases matching the regular expression
TextFragmentAbsorber textAbsorber = new TextFragmentAbsorber(@"[\S]+");
TextSearchOptions textSearchOptions = new TextSearchOptions(true);
textAbsorber.TextSearchOptions = textSearchOptions;
document.Pages.Accept(textAbsorber);
var editor = new PdfContentEditor(document);
foreach (TextFragment textFragment in textAbsorber.TextFragments)
{
foreach (TextSegment textSegment in textFragment.Segments)
{
DrawBox(editor, textFragment.Page.Number, textSegment, System.Drawing.Color.Red);
}
}
document.Save(@"C:\pdftest\36886-edited.pdf");
private static void DrawBox(PdfContentEditor editor, int page, TextSegment segment, System.Drawing.Color color)
{
var lineInfo = new LineInfo();
lineInfo.VerticeCoordinate = new[] {
(float)segment.Rectangle.LLX, (float)segment.Rectangle.LLY,
(float)segment.Rectangle.LLX, (float)segment.Rectangle.URY,
(float)segment.Rectangle.URX, (float)segment.Rectangle.URY,
(float)segment.Rectangle.URX, (float)segment.Rectangle.LLY
};
lineInfo.Visibility = true;
lineInfo.LineColor = color;
editor.CreatePolygon(lineInfo, page, new System.Drawing.Rectangle(0, 0, 0, 0), null);
}
Once polygon is created, you may consider converting PDF pages to Image format and then render them in web browser. You may also consider having a look over AnnotationApp from our sister company named GroupDocs which provides the feature to display and annotate certain page regions.
This is completely wrong answer Sir.I want the relation between text co-ordinates in PDF and Left-Top position in Web browser.I have a license of Aspose.PDF .why should I go with some another tool as you suggested.Can’t you just tell me the relation instead of suggesting some other product.please?If can’t then please specify.
FYI,
I have also tried to convert PDF text co-ordinate points by dividing the value by 72 to convert it to inches and then multiplied by 150 i.e. DPI to finally convert it to pixel values but when I provide all these values (top,left,width,height) to as absolute-positioned div in HTML ,It does not produce desired results.
Please help me out.
Hi Amit,
Any updates Nayyar Shahbaz ? or I have to wait till my last ?
Hi Amit,
Once again, please accept our sincere apologies for this delay and inconvenience.
Hello
Pergaps I better understand his need because I have the same : for each page how to find each text with its coordinates and fonts.
Personally I need to put this information into a text file which will allow many times in the future to 1) once : index text in a database, 2) many times : generate a new pdf with exactly the same text at the same places.
I think the need is very simple but I did not find a documentation which explain things as simple.
For example is there a way to have code similar to this one :
for (Page page : (Iterable) pdfDocument.getPages()) {
for (AllTextInfo ati: page.getAllTextInfo()) {
imgX = pdfX2imgX( ati.getX() );
imgY = pdfY2imgY( ati.getY() );
font = ati.getFont();
text = ati.getText();
}
}
Your examples on github use classes and classes with names which does not represent their function for me. Very dark.
TextDevice ? TextAbsorber ? Why these names ! Why do not make things simple to manipulate simpel things ?
Can you please try using the below code snippet in order to extract font information and coordinates of text inside PDF and let us know in case you face any issues:
Document document = new Document(dataDir + "input.pdf");
for(Page page: document.getPages()) {
TextFragmentAbsorber absorber = new TextFragmentAbsorber();
page.accept(absorber);
for(TextFragment textFragment: absorber.getTextFragments()) {
String fontName = textFragment.getTextState().getFont().getFontName();
Rectangle coordinates = textFragment.getRectangle();
}
}
Oh thank you very much, it looks like it’s easy in fact, your code looks a lot like what i need.
I’m sorry I should have persevered and not be impressed by names like absorb or accept that don’t really “speak” to me …
Thank you very much !
Thanks for your feedback. Please keep using our API and feel free to create a new topic in case you need further information.
Hello
The code in your answer does not give the desired result (knowing the position of all the text blocks).
Even when splitting the fragments into TextSegments, the returned text is badly split and therefore some positions are not defined.
Please see the attached screenshot: the text blocks are selected to show that they are different areas.
Screenshot.PNG (3.2 KB)
With this code:
TextFragmentAbsorber visitor = new TextFragmentAbsorber();'
page.accept( visitor );
for (TextFragment tf : visitor.getTextFragments()) {
for (TextSegment tseg : tf.getSegments()) {
System.out.println( String.format( "- '%s'", tseg.getText() ) );
}}
Your lib concatenates some areas and returns:
- 'Account'
- 'Prior Period Current Month Actual'
<<< bad
- 'Number Description'
<<< bad
- 'Balance Actuals YearTo Date'
<<< bad
Using a very old Adobe lib (that we want to replace with Aspose), a text enumeration returns the correct split:
- 33.24:638.25:64.93:638.25:'Account'
- 314.51:638.25:360.66:638.25:'Prior Period'
- 408.05:638.25:463.45:638.25:'Current Month'
- 522.23:638.25:546.29:638.25:'Actual'
- 33.24:625.89:63.58:625.89:'Number'
- 89.25:625.89:133.37:625.89:'Description'
- 322.31:625.89:353.03:625.89:'Balance'
- 421.49:625.89:450.06:625.89:'Actuals'
- 510.96:625.89:557.52:625.89:'YearTo Date'
This shows that the PDF document has separate zones corresponding to the selected text of the image, and the coordinates of each zone are correct (in our application, they allow us to reconstruct the PDF as the original).
This seems to reveal a bug in your lib, but anyway can you point us to another way to get a correct enumeration of the text?
Thank you in advance for your help
Could you please share the sample PDF as well for our reference? We will test the scenario in our environment and address it accordingly.