Difference between coordinates

Hello,
i extracted text properties using method PdfEctractor.getFormattedText(), and I’m saving the X, Y ,width , height and text in an object to be used later on.
in another class I’m using method contentEditor.getTextInRectangle(1, zone.getRect());
where zone.getRect() is the object I saved earlier using the extracted X, Y ,width and height.

Lets say I saved an object having x=100,y=65,width=87 and height =124 and text=trial
if I extract contentEditor.getTextInRectangle(1, new Rectangle(100,65,87,124)) I get a text different than “trial” knowing that I’m using the same pdf file and that the number of the page used in getTextInRectangle is correct.
is there a difference in the measurement unit??
Thank you

Hi Karine,


Thanks for contacting support and sorry for the delayed response.

I have tested the scenario and I am able to
notice the same problem. We will
further look into the details of this problem and will keep you updated on the
status of correction. Please be patient and spare us little time. We are sorry
for this inconvenience.

Hi Karine,


Thanks for using our products and sorry for the delayed response.

<span style=“font-size:10.0pt;font-family:“Arial”,“sans-serif””>I
have tested the scenario and I am able to reproduce the same problem. For the
sake of correction, I have logged it in our issue tracking system as PDFNEWJAVA-33545. We
will investigate this issue in details and will keep you updated on the status
of a correction. <o:p></o:p>

We apologize for your inconvenience.

Hello
Any updates on this issue?
Thanks

Hi Karine,


<span style=“font-size:10.0pt;font-family:“Arial”,“sans-serif”;
color:#333333;background:white”>Thanks for your patience.<o:p></o:p>

The development team has been busy resolving other priority issues and I am afraid the above stated problem is not yet resolved. Nevertheless, I have requested the development team to share any possible ETA. As soon as I have some updates regarding its resolution, I would be more than happy to update you with the status of correction. Please be patient and spare us little time.

We are sorry for this delay and inconvenience.

Hi Karine,


Thanks for your patience.

I have further discussed with development team and they are still investigating this issue and there seems to be an issue with vertical scaling definition. However as a workaround, you may consider taking the height of the page to get the desired result.

System.out.println(editor.getTextInRectangle(1,
new java.awt.Rectangle((int)text[1].getX(),665-(int)text[1].getY(),(int)text[1].getTextWidth(),(int)text[1].getTextHeight())));

for example, 665 - is the page height.

Thank you Nayyer.
We will try it asap.

Hello,

We tried your solution which has fixed our problem but another issue showed up.
If i have a file file1 composed of 2 pages : page1 , page2
and i create a different file file2 composed only of page2
The coordinates extracted from page2 in file1 are different than the ones extracted
from page2 in file2.
My question is : does the coordinates depend on the page itself or on the whole file ?
Thank you.

Hi Karine,


Can you please share the source PDF files and the code snippet which you are using so that we can test the scenario at our end. Furthermore, please note that individual pages in PDF file can have different orientation and size.

We are sorry for your inconvenience.

Thanks for your reply,
Kindly find attached the initial pdf facture.pdf and the second part of the pdf : smallfacture.pdf.
First of all we divided facture.pdf based on the facture zone.
Basically we have two facture text : FACTURE1 in page 1 and
FACTURE2 in page 3
We put the pages in vPages.
Vector<Vector> vPages = new Vector<Vector>();
vPages = [[1,2],[3]].

Then we divided the pdf into two parts using the code below:

for (Vector pages : vPages) {

String fileName = pdfFile.getName();
File tempFile = File.createTempFile(FilenameUtils.removeExtension(fileName) + System.currentTimeMillis(), "." + FilenameUtils.getExtension(fileName), getWorkingDir());
PdfFileEditor pdfEditor = new PdfFileEditor();
extractSuccess = pdfEditor.extract(pdfFile.getPath(),
pages.firstElement(),
pages.lastElement(),
tempFile.getPath());
String tmpPath = tempFile.getPath();

}

So now we have two pdfs : part1 and smallfacture.pdf.

This is how we are extracting the text from the pdfs using the coordinates :
int page = zone.getPage();
if (page <= numOfpage) {
int pageHeight = (int) pdfFileInfo.getPageHeight(page);
Rectangle oldRect = zone.getRect();
int y = (int)oldRect.getY();
int x = (int)oldRect.getX();
int height = (int)oldRect.getHeight();
int width = (int)oldRect.getWidth();
Rectangle rect = new Rectangle(x, pageHeight -y, width, height);
String fieldValue = contentEditor.getTextInRectangle(page, rect);

}

The problem is that the coordinates of the zone containing the word “carine matta”
are different between the facture.pdf and smallfacture.pdf so we are getting an empty string rather than "carine matta".

WholePDF
72.0242020.7376.9598511.04Carine Matta

Part2 of the pdf :
733636012Carine Matta

Hi,

Thanks for sharing the details and sorry for the delayed response.

I have tested the scenario using following code snippet (based on new Document Object Model approach of com.aspose.pdf package) where I have first tried getting coordinates of string Carine Matta from third page of facture.pdf file and then I have tried getting coordinates of same string from smallfacture.pdf file and as per my observations, same coordinates are being returned. Can you please try using the latest release of Aspose.Pdf for Java 4.3.1 and in case you still face the same problem, please share some further details which can help us in replicating this issue at our end. We are sorry for your inconvenience.

You may consider visit the following links for further information on

[Java]

com.aspose.pdf.Document pdfDocument = **new**
com.aspose.pdf.Document(“c:/pdftest/facture.pdf”);<o:p></o:p>

//create TextAbsorber object to find all instances of the input search phrase

com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber = **new** com.aspose.pdf.TextFragmentAbsorber("Carine Matta");

//accept the absorber for first page of document

pdfDocument.getPages().get_Item(3).accept(textFragmentAbsorber);

//get the extracted text fragments into collection

com.aspose.pdf.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();

//loop through the Text fragments

**for**(com.aspose.pdf.TextFragment textFragment : (Iterable)textFragmentCollection)

{

// iterate through text segments

**for**(com.aspose.pdf.TextSegment textSegment : (Iterable)textFragment.getSegments())

{

System.*out*.println("Text :- " + textSegment.getText());

System.*out*.println("Position :- " + textSegment.getPosition());

System.*out*.println("XIndent :- " + textSegment.getPosition().getXIndent());

System.*out*.println("YIndent :- " + textSegment.getPosition().getYIndent());

System.*out*.println("Font - Name :- " + textSegment.getTextState().getFont().getFontName());

}

}

Hello,

Thank you for your reply.
We tried to use Aspose.pdf instead of Aspose.pdf.kit ,
but we the problem is that we need to get from the TextSegment the text width and height , which are not available when using the Aspose.pdf .
The text width and height are necessary for our project.
Is there another way to extract them using Aspose.pdf ?


Thank you.

Hi,


Thanks for your patience.

When using Document Object Model, you can search particular text string and get width of TextFragment using following code line.

System.out.println("Text Width :- " +
textSegment.getText().length());

But I am afraid currently text height cannot be retrieved because getRectangle(…) method is currently not implemented which can provide the capability to get TextSegment height. For
the sake of correction, I have logged it in our issue tracking system as PDFNEWJAVA-33837. However the same class is present in Aspose.Pdf for .NET and we will further look into this matter and will try porting code to Aspose.Pdf for Java.

Please be patient and spare us little time. We
apologize for your inconvenience.

Hello,
Any updates concerning this task PDFNEWJAVA-33837?

Thank you

Hi,


Thanks for your patience.<o:p></o:p>

The development team has been busy resolving other priority issues and I am afraid the issue reported earlier is not yet resolved. Nevertheless, I have requested the team to share the ETA regarding its resolution. As soon as we have some definite updates regarding its resolution, we would be more than happy to update you with the status of correction. Please be patient and spare us little more time.

We are really sorry for this inconvenience.

Hello,
Since you advised to use Aspose.pdf instead of Aspose.pdf.kit, below is a sample of my work using Aspose.pdf:
the below code is used to define zones and their coordinates:
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(filePath);
com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber = new com.aspose.pdf.TextFragmentAbsorber();
TextSegment lastSegment = null;
pdfDocument.getPages().accept(textFragmentAbsorber);
// get the extracted text fragments into collection
com.aspose.pdf.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
// loop through the Text fragments
int index = 0;
for (com.aspose.pdf.TextFragment textFragment : (Iterable<com.aspose.pdf.TextFragment>) textFragmentCollection)
{
TextSegmentCollection textSegmentCollection = textFragment.getSegments();
for (com.aspose.pdf.TextSegment textSegment : (Iterable<com.aspose.pdf.TextSegment>) textSegmentCollection) {

zone = new Zone(
currentSegment.getRectangle().getLLX(),
currentSegment.getRectangle().getLLY(),
currentSegment.getRectangle().getHeight(),
currentSegment.getRectangle().getWidth(),
currentSegment.getText());

}
}

And to extract text for these zonesI am using the below code :
com.aspose.pdf.TextAbsorber absorber = new com.aspose.pdf.TextAbsorber();
absorber.getExtractionOptions().setFormattingMode(com.aspose.pdf.TextExtractionOptions.TextFormattingMode.Raw);
// Limit text search area to page bounds
absorber.getTextSearchOptions().setLimitToPageBounds(true);absorber.getTextSearchOptions().setRectangle(new com.aspose.pdf.Rectangle(
x,
y,
width,
height));
pdfDocument.getPages().get_Item(i).accept(absorber);
String currentText = absorber.getText();


The currentText I am getting is not the same one I extracted above currentSegment.getText()
So the same problem is occurring also with Aspose.pdf , is their something missing in my API?
Can you advise please?

Thanks.

Hi Karine,


Thanks for contacting support.

I have tested the scenario using Aspose.Pdf for Java 9.0.0 where I have used one of your earlier shared PDF files (facture.pdf) and as per my observations, I am getting same TextSegment when trying to retrieve text from particular page region. Please note that I have used the following code snippet to test the scenario and in absorber.getTextSearchOptions().setRectangle(…) method, I have specified the coordinates retrieved while traversing through all TextSegments.

[Java]

com.aspose.pdf.Document
pdfDocument =
new com.aspose.pdf.Document(“c:/pdftest/facture.pdf”);<o:p></o:p>

com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber = new com.aspose.pdf.TextFragmentAbsorber();

TextSegment lastSegment = null;

pdfDocument.getPages().accept(textFragmentAbsorber);

// get the extracted text fragments into collection

com.aspose.pdf.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();

// loop through the Text fragments

double llx =0;

double lly=0;

double urx = 0;

double ury=0;

int index = 0;

for (com.aspose.pdf.TextFragment textFragment : (Iterable)textFragmentCollection)

{

TextSegmentCollection textSegmentCollection = textFragment.getSegments();

for (com.aspose.pdf.TextSegment textSegment : (Iterable) textSegmentCollection)

{

llx = textSegment.getRectangle().getLLX();

lly = textSegment.getRectangle().getLLY();

urx = textSegment.getRectangle().getURX();

ury = textSegment.getRectangle().getURY();

System.out.println("TextSegment = "+ textSegment.getText());

System.out.println("LLX = "+ llx + " LLY = "+ lly + " URX = "+ urx + " URY = "+ ury);

}

}


Get text from particular page region.

[Java]

com.aspose.pdf.TextAbsorber
absorber =
new
com.aspose.pdf.TextAbsorber();<o:p></o:p>

absorber.getExtractionOptions().setFormattingMode(com.aspose.pdf.TextExtractionOptions.TextFormattingMode.Raw);

// Limit text search area to page bounds

absorber.getTextSearchOptions().setLimitToPageBounds(true);

absorber.getTextSearchOptions().setRectangle(new com.aspose.pdf.Rectangle(

94.344,

476.8300000095367,

123.32399989986419,

487.86999997138975));

pdfDocument.getPages().get_Item(1).accept(absorber);

String currentText = absorber.getText();

System.out.println(currentText);