Hello,
i extracted text properties using method PdfEctractor.getFormattedText(), and I’m saving the X, Y ,width , height and text in an object to be used later on.
in another class I’m using method contentEditor.getTextInRectangle(1, zone.getRect());
where zone.getRect() is the object I saved earlier using the extracted X, Y ,width and height.
Lets say I saved an object having x=100,y=65,width=87 and height =124 and text=trial
if I extract contentEditor.getTextInRectangle(1, new Rectangle(100,65,87,124)) I get a text different than “trial” knowing that I’m using the same pdf file and that the number of the page used in getTextInRectangle is correct.
is there a difference in the measurement unit??
Thank you
Hi Karine,
notice the same problem. We will
further look into the details of this problem and will keep you updated on the
status of correction. Please be patient and spare us little time. We are sorry
for this inconvenience.
Hi Karine,
<span style=“font-size:10.0pt;font-family:“Arial”,“sans-serif””>I
have tested the scenario and I am able to reproduce the same problem. For the
sake of correction, I have logged it in our issue tracking system as PDFNEWJAVA-33545. We
will investigate this issue in details and will keep you updated on the status
of a correction. <o:p></o:p>
We apologize for your inconvenience.
Hello
Any updates on this issue?
Thanks
Hi Karine,
<span style=“font-size:10.0pt;font-family:“Arial”,“sans-serif”;
color:#333333;background:white”>Thanks for your patience.<o:p></o:p>
The development team has been busy resolving other priority issues and I am afraid the above stated problem is not yet resolved. Nevertheless, I have requested the development team to share any possible ETA. As soon as I have some updates regarding its resolution, I would be more than happy to update you with the status of correction. Please be patient and spare us little time.
We are sorry for this delay and inconvenience.
Hi Karine,
new java.awt.Rectangle((int)text[1].getX(),665-(int)text[1].getY(),(int)text[1].getTextWidth(),(int)text[1].getTextHeight())));
Thank you Nayyer.
We will try it asap.
Hello,
If i have a file file1 composed of 2 pages : page1 , page2
and i create a different file file2 composed only of page2
The coordinates extracted from page2 in file1 are different than the ones extracted
from page2 in file2.
My question is : does the coordinates depend on the page itself or on the whole file ?
Thank you.
Hi Karine,
Thanks for your reply,
Kindly find attached the initial pdf facture.pdf and the second part of the pdf : smallfacture.pdf.
First of all we divided
facture.pdf based on the facture zone.
Basically we have two facture text : FACTURE1 in page 1 and
FACTURE2 in page 3
We put the pages in vPages.
Vector<Vector> vPages = new
Vector<Vector>();
vPages = [[1,2],[3]].
Then we divided the pdf into two parts using the code below:
for (Vector pages : vPages) {
String fileName =
pdfFile.getName();
File tempFile = File.createTempFile(FilenameUtils.removeExtension(fileName)
+ System.currentTimeMillis(), "." + FilenameUtils.getExtension(fileName), getWorkingDir());
PdfFileEditor pdfEditor = new PdfFileEditor();
extractSuccess =
pdfEditor.extract(pdfFile.getPath(),
pages.firstElement(),
pages.lastElement(),
tempFile.getPath());
String tmpPath = tempFile.getPath();
}
So now we have two pdfs : part1 and smallfacture.pdf.
This is how we are extracting the text from the pdfs using the coordinates :
int page =
zone.getPage();
if (page <=
numOfpage) {
int pageHeight = (int)
pdfFileInfo.getPageHeight(page);
Rectangle oldRect =
zone.getRect();
int y = (int)oldRect.getY();
int x = (int)oldRect.getX();
int height = (int)oldRect.getHeight();
int width = (int)oldRect.getWidth();
Rectangle rect = new Rectangle(x,
pageHeight -y, width, height);
String fieldValue =
contentEditor.getTextInRectangle(page, rect);
}
The problem is that the
coordinates of the zone containing the word “carine matta”
are different between the facture.pdf and smallfacture.pdf so we are getting an
empty string rather than "carine matta".
WholePDF
72.0242020.7376.9598511.04Carine
Matta
Part2 of the pdf :
733636012Carine
Matta
Hi,
Thanks for sharing the details and sorry for the delayed response.
I have tested the scenario using following code snippet (based on new Document Object Model approach of com.aspose.pdf package) where I have first tried getting coordinates of string Carine Matta from third page of facture.pdf file and then I have tried getting coordinates of same string from smallfacture.pdf file and as per my observations, same coordinates are being returned. Can you please try using the latest release of Aspose.Pdf for Java 4.3.1 and in case you still face the same problem, please share some further details which can help us in replicating this issue at our end. We are sorry for your inconvenience.
You may consider visit the following links for further information on
[Java]
com.aspose.pdf.Document pdfDocument = **new**
com.aspose.pdf.Document(“c:/pdftest/facture.pdf”);<o:p></o:p>
//create TextAbsorber object to find all instances of the input search phrase
com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber = **new** com.aspose.pdf.TextFragmentAbsorber("Carine Matta");
//accept the absorber for first page of document
pdfDocument.getPages().get_Item(3).accept(textFragmentAbsorber);
//get the extracted text fragments into collection
com.aspose.pdf.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
//loop through the Text fragments
**for**(com.aspose.pdf.TextFragment textFragment : (Iterable)textFragmentCollection)
{
// iterate through text segments
**for**(com.aspose.pdf.TextSegment textSegment : (Iterable)textFragment.getSegments())
{
System.*out*.println("Text :- " + textSegment.getText());
System.*out*.println("Position :- " + textSegment.getPosition());
System.*out*.println("XIndent :- " + textSegment.getPosition().getXIndent());
System.*out*.println("YIndent :- " + textSegment.getPosition().getYIndent());
System.*out*.println("Font - Name :- " + textSegment.getTextState().getFont().getFontName());
}
}
Hello,
Thank you for your reply.
We tried to use Aspose.pdf instead of Aspose.pdf.kit ,
but we the problem is that we need to get from the TextSegment the text width and height , which are not available when using the Aspose.pdf .
The text width and height are necessary for our project.
Is there another way to extract them using Aspose.pdf ?
Thank you.
Hi,
textSegment.getText().length());
the sake of correction, I have logged it in our issue tracking system as PDFNEWJAVA-33837. However the same class is present in Aspose.Pdf for .NET and we will further look into this matter and will try porting code to Aspose.Pdf for Java.
apologize for your inconvenience.
Hello,
Any updates concerning this task PDFNEWJAVA-33837?
Thank you
Hi,
Thanks for your patience.<o:p></o:p>
The development team has been busy resolving other priority issues and I am afraid the issue reported earlier is not yet resolved. Nevertheless, I have requested the team to share the ETA regarding its resolution. As soon as we have some definite updates regarding its resolution, we would be more than happy to update you with the status of correction. Please be patient and spare us little more time.
We are really sorry for this inconvenience.
Hello,
Since you advised to use Aspose.pdf instead of Aspose.pdf.kit, below is a sample of my work using Aspose.pdf:
the below code is used to define zones and their coordinates:
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(filePath);
com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber = new com.aspose.pdf.TextFragmentAbsorber();
TextSegment lastSegment = null;
pdfDocument.getPages().accept(textFragmentAbsorber);
// get the extracted text fragments into collection
com.aspose.pdf.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
// loop through the Text fragments
int index = 0;
for (com.aspose.pdf.TextFragment textFragment : (Iterable<com.aspose.pdf.TextFragment>) textFragmentCollection)
{
TextSegmentCollection textSegmentCollection = textFragment.getSegments();
for (com.aspose.pdf.TextSegment textSegment : (Iterable<com.aspose.pdf.TextSegment>) textSegmentCollection) {
zone = new Zone(
currentSegment.getRectangle().getLLX(),
currentSegment.getRectangle().getLLY(),
currentSegment.getRectangle().getHeight(),
currentSegment.getRectangle().getWidth(),
currentSegment.getText());
}
}
And to extract text for these zonesI am using the below code :
com.aspose.pdf.TextAbsorber absorber = new com.aspose.pdf.TextAbsorber();
absorber.getExtractionOptions().setFormattingMode(com.aspose.pdf.TextExtractionOptions.TextFormattingMode.Raw);
// Limit text search area to page bounds
absorber.getTextSearchOptions().setLimitToPageBounds(true);absorber.getTextSearchOptions().setRectangle(new com.aspose.pdf.Rectangle(
x,
y,
width,
height));
pdfDocument.getPages().get_Item(i).accept(absorber);
String currentText = absorber.getText();
The currentText I am getting is not the same one I extracted above currentSegment.getText()
So the same problem is occurring also with Aspose.pdf , is their something missing in my API?
Can you advise please?
Thanks.
Hi Karine,
com.aspose.pdf.Document
pdfDocument = new com.aspose.pdf.Document(“c:/pdftest/facture.pdf”);<o:p></o:p>
com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber = new com.aspose.pdf.TextFragmentAbsorber();
TextSegment lastSegment = null;
pdfDocument.getPages().accept(textFragmentAbsorber);
// get the extracted text fragments into collection
com.aspose.pdf.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
// loop through the Text fragments
double llx =0;
double lly=0;
double urx = 0;
double ury=0;
int index = 0;
for (com.aspose.pdf.TextFragment textFragment : (Iterable)textFragmentCollection)
{
TextSegmentCollection textSegmentCollection = textFragment.getSegments();
for (com.aspose.pdf.TextSegment textSegment : (Iterable) textSegmentCollection)
{
llx = textSegment.getRectangle().getLLX();
lly = textSegment.getRectangle().getLLY();
urx = textSegment.getRectangle().getURX();
ury = textSegment.getRectangle().getURY();
System.out.println("TextSegment = "+ textSegment.getText());
System.out.println("LLX = "+ llx + " LLY = "+ lly + " URX = "+ urx + " URY = "+ ury);
}
}
com.aspose.pdf.TextAbsorber
absorber = new
com.aspose.pdf.TextAbsorber();<o:p></o:p>
absorber.getExtractionOptions().setFormattingMode(com.aspose.pdf.TextExtractionOptions.TextFormattingMode.Raw);
// Limit text search area to page bounds
absorber.getTextSearchOptions().setLimitToPageBounds(true);
absorber.getTextSearchOptions().setRectangle(new com.aspose.pdf.Rectangle(
94.344,
476.8300000095367,
123.32399989986419,
487.86999997138975));
pdfDocument.getPages().get_Item(1).accept(absorber);
String currentText = absorber.getText();
System.out.println(currentText);