Difference between coordinates

karine_87 · June 12, 2013, 5:21am

Hello,
i extracted text properties using method PdfEctractor.getFormattedText(), and I’m saving the X, Y ,width , height and text in an object to be used later on.
in another class I’m using method contentEditor.getTextInRectangle(1, zone.getRect());
where zone.getRect() is the object I saved earlier using the extracted X, Y ,width and height.

Lets say I saved an object having x=100,y=65,width=87 and height =124 and text=trial
if I extract contentEditor.getTextInRectangle(1, new Rectangle(100,65,87,124)) I get a text different than “trial” knowing that I’m using the same pdf file and that the number of the page used in getTextInRectangle is correct.
is there a difference in the measurement unit??
Thank you

codewarior · June 16, 2013, 9:09pm

Hi Karine,

Thanks for contacting support and sorry for the delayed response.

I have tested the scenario and I am able to
notice the same problem. We will
further look into the details of this problem and will keep you updated on the
status of correction. Please be patient and spare us little time. We are sorry
for this inconvenience.

codewarior · June 17, 2013, 6:07am

Hi Karine,

Thanks for using our products and sorry for the delayed response.

I have tested the scenario and I am able to reproduce the same problem. For the sake of correction, I have logged it in our issue tracking system as PDFNEWJAVA-33545. We will investigate this issue in details and will keep you updated on the status of a correction.

We apologize for your inconvenience.

karine_87 · July 29, 2013, 7:50am

Hello
Any updates on this issue?
Thanks

codewarior · July 30, 2013, 8:10am

Hi Karine,

Thanks for your patience.

The development team has been busy resolving other priority issues and I am afraid the above stated problem is not yet resolved. Nevertheless, I have requested the development team to share any possible ETA. As soon as I have some updates regarding its resolution, I would be more than happy to update you with the status of correction. Please be patient and spare us little time.

We are sorry for this delay and inconvenience.

codewarior · July 31, 2013, 6:54am

Hi Karine,

Thanks for your patience.

I have further discussed with development team and they are still investigating this issue and there seems to be an issue with vertical scaling definition. However as a workaround, you may consider taking the height of the page to get the desired result.

System.out.println(editor.getTextInRectangle(1,
new java.awt.Rectangle((int)text[1].getX(),665-(int)text[1].getY(),(int)text[1].getTextWidth(),(int)text[1].getTextHeight())));

for example, 665 - is the page height.

karine_87 · August 1, 2013, 6:28am

Thank you Nayyer.
We will try it asap.

karine_87 · November 12, 2013, 2:27am

Hello,

We tried your solution which has fixed our problem but another issue showed up.
If i have a file file1 composed of 2 pages : page1 , page2
and i create a different file file2 composed only of page2
The coordinates extracted from page2 in file1 are different than the ones extracted
from page2 in file2.
My question is : does the coordinates depend on the page itself or on the whole file ?
Thank you.

codewarior · November 12, 2013, 7:15am

Hi Karine,

Can you please share the source PDF files and the code snippet which you are using so that we can test the scenario at our end. Furthermore, please note that individual pages in PDF file can have different orientation and size.

We are sorry for your inconvenience.

karine_87 · November 13, 2013, 2:41am

Thanks for your reply,

Kindly find attached the initial PDF facture.pdf and the second part of the PDF: smallfacture.pdf.

First of all, we divided facture.pdf based on the facture zone. Basically, we have two facture texts: FACTURE1 on page 1 and FACTURE2 on page 3. We put the pages in vPages.

Vector<Vector<Integer>> vPages = new Vector<Vector<Integer>>();
vPages = [[1,2],[3]];

Then we divided the PDF into two parts using the code below:

for (Vector<Integer> pages : vPages) {
    String fileName = pdfFile.getName();
    File tempFile = File.createTempFile(
        FilenameUtils.removeExtension(fileName) + System.currentTimeMillis(),
        "." + FilenameUtils.getExtension(fileName), getWorkingDir()
    );
    PdfFileEditor pdfEditor = new PdfFileEditor();
    extractSuccess = pdfEditor.extract(
        pdfFile.getPath(),
        pages.firstElement(),
        pages.lastElement(),
        tempFile.getPath()
    );
    String tmpPath = tempFile.getPath();
}

So now we have two PDFs: part1 and smallfacture.pdf.

This is how we are extracting the text from the PDFs using the coordinates:

int page = zone.getPage();
if (page <= numOfpage) {
    int pageHeight = (int) pdfFileInfo.getPageHeight(page);
    Rectangle oldRect = zone.getRect();
    int y = (int)oldRect.getY();
    int x = (int)oldRect.getX();
    int height = (int)oldRect.getHeight();
    int width = (int)oldRect.getWidth();
    Rectangle rect = new Rectangle(x, pageHeight -y, width, height);
    String fieldValue = contentEditor.getTextInRectangle(page, rect);
}

The problem is that the coordinates of the zone containing the word “Carine Matta” are different between facture.pdf and smallfacture.pdf, so we are getting an empty string rather than “Carine Matta”.

WholePDF:

<zone>
    <coordinates><x>72.024</x><y>2020.73</y></coordinates>
    <width>76.95985</width><height>11.04</height>
    <text>Carine Matta</text>
</zone>

Part2 of the PDF:

<zone>
    <coordinates><x>73</x><y>363</y></coordinates>
    <width>60</width><height>12</height>
    <text>Carine Matta</text>
</zone>

codewarior · November 15, 2013, 12:40pm

Hi,

Thanks for sharing the details and sorry for the delayed response.

I have tested the scenario using the following code snippet (based on the new Document Object Model approach of com.aspose.pdf package) where I have first tried getting coordinates of string Carine Matta from the third page of facture.pdf file and then I have tried getting coordinates of the same string from smallfacture.pdf file. As per my observations, the same coordinates are being returned. Can you please try using the latest release of Aspose.Pdf for Java 23.5 and in case you still face the same problem, please share some further details which can help us in replicating this issue at our end. We are sorry for your inconvenience.

You may consider visiting the following links for further information on

[Java]:

import com.aspose.pdf.Document;
import com.aspose.pdf.TextFragmentAbsorber;
import com.aspose.pdf.TextFragmentCollection;
import com.aspose.pdf.TextFragment;
import com.aspose.pdf.TextSegment;

public class SearchTextInPdf {
    public static void main(String[] args) {
        // Load the PDF document
        Document pdfDocument = new Document("c:/pdftest/facture.pdf");

        // Create TextAbsorber object to find all instances of the input search phrase
        TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Carine Matta");

        // Accept the absorber for the third page of the document
        pdfDocument.GetPages().Get_Item(3).Get_accept(textFragmentAbsorber);

        // Get the extracted text fragments into collection
        TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();

        // Loop through the text fragments
        for (TextFragment textFragment : textFragmentCollection) {
            // Iterate through text segments
            for (TextSegment textSegment : textFragment.GetSegments()) {
                System.out.println("Text: " + textSegment.GetText());
                System.out.println("Position: " + textSegment.getTextState().GetPosition());
                System.out.println("XIndent: " + textSegment.getTextState().GetPosition().GetXIndent());
                System.out.println("YIndent: " + textSegment.getTextState().GetPosition().GetYIndent());
                System.out.println("Font - Name: " + textSegment.getTextState().GetFont().GetFontName());
            }
        }
    }
}

karine_87 · November 18, 2013, 4:13am

Hello,

Thank you for your reply.
We tried to use Aspose.pdf instead of Aspose.pdf.kit ,
but we the problem is that we need to get from the TextSegment the text width and height , which are not available when using the Aspose.pdf .
The text width and height are necessary for our project.
Is there another way to extract them using Aspose.pdf ?

Thank you.

codewarior · November 19, 2013, 4:15am

Hi,

Thanks for your patience.

When using Document Object Model, you can search particular text string and get width of TextFragment using following code line.

System.out.println("Text Width :- " +
textSegment.getText().length());

But I am afraid currently text height cannot be retrieved because getRectangle(…) method is currently not implemented which can provide the capability to get TextSegment height. For
the sake of correction, I have logged it in our issue tracking system as PDFNEWJAVA-33837. However the same class is present in Aspose.Pdf for .NET and we will further look into this matter and will try porting code to Aspose.Pdf for Java.

Please be patient and spare us little time. We
apologize for your inconvenience.

karine_87 · April 7, 2014, 12:35am

Hello,
Any updates concerning this task PDFNEWJAVA-33837?

Thank you

codewarior · April 7, 2014, 2:15pm

Hi,

Thanks for your patience.

The development team has been busy resolving other priority issues and I am afraid the issue reported earlier is not yet resolved. Nevertheless, I have requested the team to share the ETA regarding its resolution. As soon as we have some definite updates regarding its resolution, we would be more than happy to update you with the status of correction. Please be patient and spare us a little more time.

We are really sorry for this inconvenience.

karine_87 · May 7, 2014, 12:32am

Hello,
Since you advised to use Aspose.pdf instead of Aspose.pdf.kit, below is a sample of my work using Aspose.pdf:
the below code is used to define zones and their coordinates:
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(filePath);
com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber = new com.aspose.pdf.TextFragmentAbsorber();
TextSegment lastSegment = null;
pdfDocument.getPages().accept(textFragmentAbsorber);
// get the extracted text fragments into collection
com.aspose.pdf.TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();
// loop through the Text fragments
int index = 0;
for (com.aspose.pdf.TextFragment textFragment : (Iterable<com.aspose.pdf.TextFragment>) textFragmentCollection)
{
TextSegmentCollection textSegmentCollection = textFragment.getSegments();
for (com.aspose.pdf.TextSegment textSegment : (Iterable<com.aspose.pdf.TextSegment>) textSegmentCollection) {

zone = new Zone(
currentSegment.getRectangle().getLLX(),
currentSegment.getRectangle().getLLY(),
currentSegment.getRectangle().getHeight(),
currentSegment.getRectangle().getWidth(),
currentSegment.getText());

}
}
And to extract text for these zonesI am using the below code :
com.aspose.pdf.TextAbsorber absorber = new com.aspose.pdf.TextAbsorber();
absorber.getExtractionOptions().setFormattingMode(com.aspose.pdf.TextExtractionOptions.TextFormattingMode.Raw);
// Limit text search area to page bounds
absorber.getTextSearchOptions().setLimitToPageBounds(true);absorber.getTextSearchOptions().setRectangle(new com.aspose.pdf.Rectangle(
x,
y,
width,
height));
pdfDocument.getPages().get_Item(i).accept(absorber);
String currentText = absorber.getText();

The currentText I am getting is not the same one I extracted above currentSegment.getText()
So the same problem is occurring also with Aspose.pdf , is their something missing in my API?
Can you advise please?

Thanks.

codewarior · May 8, 2014, 12:09am

Hi Karine,

Thanks for contacting support.

I have tested the scenario using Aspose.Pdf for Java 9.0.0 where I have used one of your earlier shared PDF files (facture.pdf) and as per my observations, I am getting same TextSegment when trying to retrieve text from particular page region. Please note that I have used the following code snippet to test the scenario and in absorber.getTextSearchOptions().setRectangle(…) method, I have specified the coordinates retrieved while traversing through all TextSegments.

[Java]

com.aspose.pdf.Document pdfDocument =
 new com.aspose.pdf.Document("c:/pdftest/facture.pdf");

com.aspose.pdf.TextFragmentAbsorber textFragmentAbsorber =
 new com.aspose.pdf.TextFragmentAbsorber();
TextSegment lastSegment = null;

pdfDocument.getPages().accept(textFragmentAbsorber);

// get the extracted text fragments into collection
com.aspose.pdf.TextFragmentCollection
textFragmentCollection = textFragmentAbsorber.getTextFragments();

// loop through the Text fragments
double llx = 0;
double lly = 0;
double urx = 0;
double ury= 0;
int index = 0;

for (com.aspose.pdf.TextFragment textFragment :
(Iterable<com.aspose.pdf.TextFragment>)textFragmentCollection) {

    TextSegmentCollection textSegmentCollection = textFragment.getSegments();
    for (com.aspose.pdf.TextSegment textSegment :
    (Iterable<com.aspose.pdf.TextSegment>) textSegmentCollection) {

        llx = textSegment.getRectangle().getLLX();
        lly = textSegment.getRectangle().getLLY();
        urx = textSegment.getRectangle().getURX();
        ury = textSegment.getRectangle().getURY();
        System.out.println("TextSegment = " +
            textSegment.getText());
        System.out.println("LLX = " + llx + "   LLY = " +
            lly + "   URX = " + urx + "   URY = " + ury);
    }
}

Get text from particular page region.

[Java]

com.aspose.pdf.TextAbsorber absorber =
new com.aspose.pdf.TextAbsorber();

absorber.getExtractionOptions().setFormattingMode(com.aspose.pdf.TextExtractionOptions.TextFormattingMode.Raw);

// Limit text search area to page bounds
absorber.getTextSearchOptions().setLimitToPageBounds(true);

//absorber.getTextSearchOptions().setRectangle(new
//com.aspose.pdf.Rectangle(94.344, 476.8300000095367,
//123.32399989986419, 487.86999997138975));

pdfDocument.getPages().get_Item(1).accept(absorber);
String currentText = absorber.getText();

System.out.println(currentText);