Hello,
i extracted text properties using method PdfEctractor.getFormattedText(), and I’m saving the X, Y ,width , height and text in an object to be used later on.
in another class I’m using method contentEditor.getTextInRectangle(1, zone.getRect());
where zone.getRect() is the object I saved earlier using the extracted X, Y ,width and height.
Lets say I saved an object having x=100,y=65,width=87 and height =124 and text=trial
if I extract contentEditor.getTextInRectangle(1, new Rectangle(100,65,87,124)) I get a text different than “trial” knowing that I’m using the same pdf file and that the number of the page used in getTextInRectangle is correct.
is there a difference in the measurement unit??
Thank you
Hi Karine,
notice the same problem. We will
further look into the details of this problem and will keep you updated on the
status of correction. Please be patient and spare us little time. We are sorry
for this inconvenience.
Hi Karine,
Thanks for using our products and sorry for the delayed response.
I have tested the scenario and I am able to reproduce the same problem. For the sake of correction, I have logged it in our issue tracking system as PDFNEWJAVA-33545. We will investigate this issue in details and will keep you updated on the status of a correction.
We apologize for your inconvenience.
Hello
Any updates on this issue?
Thanks
Hi Karine,
Thanks for your patience.
The development team has been busy resolving other priority issues and I am afraid the above stated problem is not yet resolved. Nevertheless, I have requested the development team to share any possible ETA. As soon as I have some updates regarding its resolution, I would be more than happy to update you with the status of correction. Please be patient and spare us little time.
We are sorry for this delay and inconvenience.
Hi Karine,
new java.awt.Rectangle((int)text[1].getX(),665-(int)text[1].getY(),(int)text[1].getTextWidth(),(int)text[1].getTextHeight())));
Thank you Nayyer.
We will try it asap.
Hello,
If i have a file file1 composed of 2 pages : page1 , page2
and i create a different file file2 composed only of page2
The coordinates extracted from page2 in file1 are different than the ones extracted
from page2 in file2.
My question is : does the coordinates depend on the page itself or on the whole file ?
Thank you.
Hi Karine,
Thanks for your reply,
Kindly find attached the initial PDF facture.pdf and the second part of the PDF: smallfacture.pdf.
First of all, we divided facture.pdf based on the facture zone. Basically, we have two facture texts: FACTURE1 on page 1 and FACTURE2 on page 3. We put the pages in vPages
.
Vector<Vector<Integer>> vPages = new Vector<Vector<Integer>>();
vPages = [[1,2],[3]];
Then we divided the PDF into two parts using the code below:
for (Vector<Integer> pages : vPages) {
String fileName = pdfFile.getName();
File tempFile = File.createTempFile(
FilenameUtils.removeExtension(fileName) + System.currentTimeMillis(),
"." + FilenameUtils.getExtension(fileName), getWorkingDir()
);
PdfFileEditor pdfEditor = new PdfFileEditor();
extractSuccess = pdfEditor.extract(
pdfFile.getPath(),
pages.firstElement(),
pages.lastElement(),
tempFile.getPath()
);
String tmpPath = tempFile.getPath();
}
So now we have two PDFs: part1 and smallfacture.pdf.
This is how we are extracting the text from the PDFs using the coordinates:
int page = zone.getPage();
if (page <= numOfpage) {
int pageHeight = (int) pdfFileInfo.getPageHeight(page);
Rectangle oldRect = zone.getRect();
int y = (int)oldRect.getY();
int x = (int)oldRect.getX();
int height = (int)oldRect.getHeight();
int width = (int)oldRect.getWidth();
Rectangle rect = new Rectangle(x, pageHeight -y, width, height);
String fieldValue = contentEditor.getTextInRectangle(page, rect);
}
The problem is that the coordinates of the zone containing the word “Carine Matta” are different between facture.pdf and smallfacture.pdf, so we are getting an empty string rather than “Carine Matta”.
WholePDF:
<zone>
<coordinates><x>72.024</x><y>2020.73</y></coordinates>
<width>76.95985</width><height>11.04</height>
<text>Carine Matta</text>
</zone>
Part2 of the PDF:
<zone>
<coordinates><x>73</x><y>363</y></coordinates>
<width>60</width><height>12</height>
<text>Carine Matta</text>
</zone>
Hi,
Thanks for sharing the details and sorry for the delayed response.
I have tested the scenario using the following code snippet (based on the new Document Object Model approach of com.aspose.pdf package) where I have first tried getting coordinates of string Carine Matta from the third page of facture.pdf file and then I have tried getting coordinates of the same string from smallfacture.pdf file. As per my observations, the same coordinates are being returned. Can you please try using the latest release of Aspose.Pdf for Java 23.5 and in case you still face the same problem, please share some further details which can help us in replicating this issue at our end. We are sorry for your inconvenience.
You may consider visiting the following links for further information on
[Java]:
import com.aspose.pdf.Document;
import com.aspose.pdf.TextFragmentAbsorber;
import com.aspose.pdf.TextFragmentCollection;
import com.aspose.pdf.TextFragment;
import com.aspose.pdf.TextSegment;
public class SearchTextInPdf {
public static void main(String[] args) {
// Load the PDF document
Document pdfDocument = new Document("c:/pdftest/facture.pdf");
// Create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Carine Matta");
// Accept the absorber for the third page of the document
pdfDocument.GetPages().Get_Item(3).Get_accept(textFragmentAbsorber);
// Get the extracted text fragments into collection
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
// Loop through the text fragments
for (TextFragment textFragment : textFragmentCollection) {
// Iterate through text segments
for (TextSegment textSegment : textFragment.GetSegments()) {
System.out.println("Text: " + textSegment.GetText());
System.out.println("Position: " + textSegment.getTextState().GetPosition());
System.out.println("XIndent: " + textSegment.getTextState().GetPosition().GetXIndent());
System.out.println("YIndent: " + textSegment.getTextState().GetPosition().GetYIndent());
System.out.println("Font - Name: " + textSegment.getTextState().GetFont().GetFontName());
}
}
}
}
Hello,
Thank you for your reply.
We tried to use Aspose.pdf instead of Aspose.pdf.kit ,
but we the problem is that we need to get from the TextSegment the text width and height , which are not available when using the Aspose.pdf .
The text width and height are necessary for our project.
Is there another way to extract them using Aspose.pdf ?
Thank you.
Hi,
textSegment.getText().length());
the sake of correction, I have logged it in our issue tracking system as PDFNEWJAVA-33837. However the same class is present in Aspose.Pdf for .NET and we will further look into this matter and will try porting code to Aspose.Pdf for Java.
apologize for your inconvenience.
Hello,
Any updates concerning this task PDFNEWJAVA-33837?
Thank you
Hi,
Thanks for your patience.
The development team has been busy resolving other priority issues and I am afraid the issue reported earlier is not yet resolved. Nevertheless, I have requested the team to share the ETA regarding its resolution. As soon as we have some definite updates regarding its resolution, we would be more than happy to update you with the status of correction. Please be patient and spare us a little more time.
We are really sorry for this inconvenience.
Hello,
Since you advised to use Aspose.pdf instead of Aspose.pdf.kit, below is a sample of my work using Aspose.pdf:
the below code is used to define zones and their coordinates:
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(filePath);
com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber = new com.aspose.pdf.TextFragmentAbsorber();
TextSegment lastSegment = null;
pdfDocument.getPages().accept(textFragmentAbsorber);
// get the extracted text fragments into collection
com.aspose.pdf.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
// loop through the Text fragments
int index = 0;
for (com.aspose.pdf.TextFragment textFragment : (Iterable<com.aspose.pdf.TextFragment>) textFragmentCollection)
{
TextSegmentCollection textSegmentCollection = textFragment.getSegments();
for (com.aspose.pdf.TextSegment textSegment : (Iterable<com.aspose.pdf.TextSegment>) textSegmentCollection) {
zone = new Zone(
currentSegment.getRectangle().getLLX(),
currentSegment.getRectangle().getLLY(),
currentSegment.getRectangle().getHeight(),
currentSegment.getRectangle().getWidth(),
currentSegment.getText());
}
}
And to extract text for these zonesI am using the below code :
com.aspose.pdf.TextAbsorber absorber = new com.aspose.pdf.TextAbsorber();
absorber.getExtractionOptions().setFormattingMode(com.aspose.pdf.TextExtractionOptions.TextFormattingMode.Raw);
// Limit text search area to page bounds
absorber.getTextSearchOptions().setLimitToPageBounds(true);absorber.getTextSearchOptions().setRectangle(new com.aspose.pdf.Rectangle(
x,
y,
width,
height));
pdfDocument.getPages().get_Item(i).accept(absorber);
String currentText = absorber.getText();
The currentText I am getting is not the same one I extracted above currentSegment.getText()
So the same problem is occurring also with Aspose.pdf , is their something missing in my API?
Can you advise please?
Thanks.
Hi Karine,
Thanks for contacting support.
I have tested the scenario using Aspose.Pdf for Java 9.0.0 where I have used one of your earlier shared PDF files (facture.pdf) and as per my observations, I am getting same TextSegment when trying to retrieve text from particular page region. Please note that I have used the following code snippet to test the scenario and in absorber.getTextSearchOptions().setRectangle(…) method, I have specified the coordinates retrieved while traversing through all TextSegments.
[Java]
com.aspose.pdf.Document pdfDocument =
new com.aspose.pdf.Document("c:/pdftest/facture.pdf");
com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber =
new com.aspose.pdf.TextFragmentAbsorber();
TextSegment lastSegment = null;
pdfDocument.getPages().accept(textFragmentAbsorber);
// get the extracted text fragments into collection
com.aspose.pdf.TextFragmentCollection
textFragmentCollection = textFragmentAbsorber.getTextFragments();
// loop through the Text fragments
double llx = 0;
double lly = 0;
double urx = 0;
double ury= 0;
int index = 0;
for (com.aspose.pdf.TextFragment textFragment :
(Iterable<com.aspose.pdf.TextFragment>)textFragmentCollection) {
TextSegmentCollection textSegmentCollection = textFragment.getSegments();
for (com.aspose.pdf.TextSegment textSegment :
(Iterable<com.aspose.pdf.TextSegment>) textSegmentCollection) {
llx = textSegment.getRectangle().getLLX();
lly = textSegment.getRectangle().getLLY();
urx = textSegment.getRectangle().getURX();
ury = textSegment.getRectangle().getURY();
System.out.println("TextSegment = " +
textSegment.getText());
System.out.println("LLX = " + llx + " LLY = " +
lly + " URX = " + urx + " URY = " + ury);
}
}
Get text from particular page region.
[Java]
com.aspose.pdf.TextAbsorber absorber =
new com.aspose.pdf.TextAbsorber();
absorber.getExtractionOptions().setFormattingMode(com.aspose.pdf.TextExtractionOptions.TextFormattingMode.Raw);
// Limit text search area to page bounds
absorber.getTextSearchOptions().setLimitToPageBounds(true);
//absorber.getTextSearchOptions().setRectangle(new
//com.aspose.pdf.Rectangle(94.344, 476.8300000095367,
//123.32399989986419, 487.86999997138975));
pdfDocument.getPages().get_Item(1).accept(absorber);
String currentText = absorber.getText();
System.out.println(currentText);