Incorrect X coordinates for PDF converted from DOC by Aspose.Words

akorznikov · December 21, 2018, 12:55pm

I try to get coordinates for every words at page image converted from DOC file by Aspose. It’s not possible to get coordinates of words for Doc, so I convert DOC to PDF (by Aspose.Words) and then render PDF to Image and get coordinates for all words. But X coordinates for words at the end of text line, if it has “fit to page” aligment (by whitespaces) X coordinate for last words smaller that should be.

asad.ali · December 21, 2018, 6:09pm

@akorznikov

Thanks for contacting support.

Would you please share your sample file(s) along with the code snippet that you are using to extract coordinates. Also, please share some more details about your complete requirements. We will test the scenario in our environment and address it accordingly.

akorznikov · December 24, 2018, 11:28am

I create a .doc file for example.sample-cyr.zip (38.5 KB)

This file was transformed to PDF by this code:

@Test
public void testAsposeWordToPDFRender() throws Exception {
    Document doc = new Document(workingFolder + "word/" + fileName);

    PdfSaveOptions options = new PdfSaveOptions();
    options.setEmbedFullFonts(true);
    options.setPageMode(PdfPageMode.USE_NONE);

    doc.save(workingFolder + "output/text-aspose-cyr.pdf", options);
}

Result is here text-aspose-cyr.pdf (726.3 KB)

After it was rendered in PNG with extracted text fragments, by this code:

public void extractBlocksFromPDF(String inputFileName) throws IOException {
    Document doc = new Document(workingFolder + "pdf/"+ inputFileName + ".pdf");
    Page page = doc.getPages().get_Item(1);
    page.getPageInfo().setMargin(new MarginInfo(0,0,0,0));

    FileOutputStream stream = getFileOutputStream("out-" + inputFileName + "-pdf");
    Resolution resolution = new Resolution(dpi);
    PngDevice pngDevice = new PngDevice(resolution);
    BufferedImage bi = pngDevice.processToBufferedImage(page);

    TextFragmentAbsorber tfabs = new TextFragmentAbsorber();
    page.accept(tfabs);

    tfabs.getTextFragments().iterator().forEachRemaining(tf -> {
        drawRect(tf.getRectangle(), bi, Color.RED);
    });

    ImageIO.write(bi, "PNG", stream);
    // Close the stream
    stream.close();
}

protected void drawRect(Rectangle r, BufferedImage bi, Color color) {
    int width = bi.getWidth();
    int height = bi.getHeight();

    Graphics g = bi.getGraphics();

    int pxLLX = toPX(r.getLLX());
    int pxLLY = toPX(r.getLLY());
    int pxW = toPX(r.getWidth());
    int pxH = toPX(r.getHeight());

    g.setColor(color);
    g.drawRect(pxLLX, height - pxLLY - pxH, pxW, pxH);

}

protected int toPX(double i) {
    return (int) (i*dpi/72f);
}

The result is here out-text-aspose-cyr-pdf.png (281.1 KB)

asad.ali · December 24, 2018, 6:45pm

@akorznikov

Thanks for sharing requested information.

Could you please also share the definition of method i.e. getFileOutputStream() and value to dpi variable in your code snippet. It would help us testing the scenario in our environment and address it accordingly.

akorznikov · December 25, 2018, 11:27am

Sorry

protected int dpi = 300;

protected FileOutputStream getFileOutputStream(String fileName) throws IOException {
    File file = new File(workingFolder + "output/" + fileName + ".png");

    if (file.exists()) {
        file.delete();
        file.createNewFile();
    }

    return new FileOutputStream(file);
}

asad.ali · December 25, 2018, 7:17pm

@akorznikov

Thanks for providing requested information.

We have tested the scenario in our environment and observed the image generated by using Aspose.PDF for Java 18.11. As per our understandings, you are concerned about the drawn rectangle around last words of each line in PDF Page. As we noticed that some of the rectangles were not drawn correctly.

We have attached a screenshot with showing the words which we have noticed. Would you please check it and confirm if you have any other requirement. We will further proceed to help you out.

Coordinates.png (94.3 KB)

akorznikov · December 26, 2018, 10:06am

Yes, I confirm. I’m not really concerned about white space before the blocks, but i concern about cutting end of words.

I disabled printout of blocks with spaces only, see the snippet:

tfabs.getTextFragments().iterator().forEachRemaining(tf -> {
    if (tf.getText().trim().length() > 0)
        drawRect(tf.getRectangle(), bi, Color.RED);
});

And got more evident results:out-text-aspose-cyr-pdf.png (544.5 KB)

I transform the same Word file to XPS, and then render it through Aspose.PDF with Xps Options, results was more accurate but not perfect.
XPS result file:out-sample-cyr.doc.xps.zip (94.2 KB)

Image Result file (with marks):out-sample-cyr-xps.png (942.5 KB)

Snippet:

public void testAsposeWordToXPSRender() throws Exception {
    Document doc = new Document(workingFolder + "word/" + fileName);

    doc.save(workingFolder + "output/out-" + fileName + ".xps", SaveFormat.XPS);

}


public void extractBlocksFromXPS(String inputFileName) throws IOException {
    XpsLoadOptions options = new XpsLoadOptions();

    Document doc = new Document(workingFolder + "xps/"+ inputFileName + ".xps", options);

    assert doc.getPages() != null;
    assert doc.getPages().size() > 0;

    Page page = doc.getPages().get_Item(1);
    page.getPageInfo().setMargin(new MarginInfo(0,0,0,0));

    FileOutputStream stream = getFileOutputStream("out-" + inputFileName + "-xps");
    Resolution resolution = new Resolution(dpi);
    PngDevice pngDevice = new PngDevice(resolution);

    BufferedImage bi = pngDevice.processToBufferedImage(page);

    TextFragmentAbsorber tfabs = new TextFragmentAbsorber();

    page.accept(tfabs);

    tfabs.getTextFragments().iterator().forEachRemaining(tf -> {
        if (tf.getText().trim().length() > 0)
            drawRect(tf.getRectangle(), bi, Color.RED);
    });

    ImageIO.write(bi, "PNG", stream);
    // Close the stream
    stream.close();
}

asad.ali · December 26, 2018, 6:10pm

@akorznikov

Thanks for further elaborations and providing more details.

We were able to observe the issue in our environment and logged an investigation ticket as PDFJAVA-38251 in our issue tracking system. We will further look into details of this behavior of the API and keep you posted with the status of ticket resolution. Please be patient and spare us little time.

We are sorry for the inconvenience.