Get Coordinates of Text Tables Pictures in Word Document & Convert DOC DOCX to PDF using Java

beralex · January 9, 2020, 11:08am

Good afternoon!
There was a problem working with Aspose.Words for Java (DOCX, DOC documents).
We need to measure the distance from the text (table, image, element) to the footer (see Fig. “Distance between text (table, figure) and header (footer).png”):

The distance from the upper border of the text to the lower border of the header (there is always a table with visible borders in the header)
The distance from the lower border of the text to the upper border of the footer (there is always a table with visible borders in the footer)
Because we didn’t find how to measure this distance using the standard Aspose.Words for Java functional, we did the following:
Convert a DOCX or DOC file to PDF
Measure the distance pixel by pixel in the resulting file
But with such an implementation, problems arose:
For example, the file “Example.docx” was taken.
The file “Example.doc” (from point 1) was converted to PDF. The conversion result (Example_1.pdf) is the same as the source file.
The file “Example.doc” (from clause 1) was re-converted to PDF. The conversion result (Example_2.pdf) does not match the source file: on pages 8-11, the header and footer are missing.

Request:

Tell me, please, is there a way to measure the distance from the text (table, image, element) to the header and footer using Aspose.Words for Java tools? Maybe there is a way to access the MS WORD Ruler tool?
What could be the problem with the disappearance of the footers when converting DOC / DOCX to PDF and how to solve it?

All files are attached in the archive Example_document.zip (518.9 KB).

awais.hafeez · January 10, 2020, 5:33am

@beralex,

We are working on your query and will get back to you soon.

awais.hafeez · January 10, 2020, 7:22am

@beralex,

Please check:

For example, the following code will return coordinates rectangle [(left, top)] and [(width, height)] of all Shapes (images) in Word document:

Document doc = new Document("E:\\Temp\\in.docx");

LayoutCollector collector = new LayoutCollector(doc);
LayoutEnumerator enumerator = new LayoutEnumerator(doc);

for (Shape shape : (Iterable<Shape>) doc.getFirstSection().getBody().getChildNodes(NodeType.SHAPE, true)) {
    enumerator.setCurrent(collector.getEntity(shape));

    String left = String.format("%.2f", enumerator.getRectangle().getX());
    String top = String.format("%.2f", enumerator.getRectangle().getY());
    String width = String.format("%.2f", enumerator.getRectangle().getWidth());
    String height = String.format("%.2f", enumerator.getRectangle().getHeight());

    System.out.print("[(x, y) = (" + left + ", " + top + ")]");
    System.out.println(" AND [(width, height) = (" + width + ", " + height + ")]");
}

You can use the same logic to calculate coordinates of any node in Word document.

Secondly, after an initial test with the licensed latest version of Aspose.Words for Java i.e. 20.1, we were unable to reproduce this issue (as shown in 8~11 pages of “Example_2.pdf”) on our end. We used the following simple code to produce a “awjava-20.1.pdf” on our end:

awjava-20.1.pdf (210.2 KB)

Java Code:

Document doc = new Document("E:\\Temp\\example_document\\Example.doc");
doc.save("E:\\temp\\example_document\\awjava-20.1.pdf");

So, please upgrade to the latest version i.e. 20.1. Hope, this helps.

beralex · January 10, 2020, 9:44am

Good afternoon!
Thanks, updating to the new version helped.

Question about obtaining coordinates in MS Word: is it possible to get the coordinates of the elements of the first line on each page? If so, can you give an example?

We tried using the code specified at LayoutEntity.java

It turned out to get the lines, but a new problem arose: in the text of the lines “null” objects began to appear. And in the document other tags / objects are not visually visible. When receiving text from a paragraph (not a line), zero objects do not appear.

Example:
Source document:ExampleDoc.docx
The result of getting rows on page 4 from the original document (pay attention to the first 2 lines): RenderedDocumentResultPage4.txt

Can you tell me, please, what could be the problem? Files are attached in the archive Example.zip (81.2 KB)

awais.hafeez · January 10, 2020, 2:28pm

@beralex,

Please see the following code that prints the (x,y) position of first character on every page. Hope, this helps in achieving what you are looking for:

Document doc = new Document("E:\\Temp\\Example\\ExampleDoc.docx");

for (Field field : doc.getRange().getFields())
    field.unlink();

Node[] runs = doc.getChildNodes(NodeType.RUN, true).toArray();
for (int i = 0; i < runs.length; i++)
{
    Run run = (Run)runs[i];
    int length = run.getText().length();

    Run currentNode = run;
    for (int x = 1; x < length; x++)
    {
        currentNode = SplitRun(currentNode, 1);
    }
}

NodeCollection smallRuns = doc.getChildNodes(NodeType.RUN, true);
LayoutCollector collector = new LayoutCollector(doc);

ArrayList list = new ArrayList();
int pageIndex = 1;
for (int i=0; i< smallRuns.getCount() ; i++) {
    Run run = (Run) smallRuns.get(i);
    if (/*!run.getText().trim().equals("") && */collector.getStartPageIndex(run) == pageIndex ) {
        list.add(run);
        pageIndex++;
    }
}

DocumentBuilder builder = new DocumentBuilder(doc);
for (int i =0 ; i< list.size(); i++) {
    Run run = (Run) list.get(i);
    builder.moveTo(run);
    builder.startBookmark("bm_" + i);
    BookmarkEnd end = builder.endBookmark("bm_" + i);
    run.getParentNode().insertAfter(end, run);
}

doc.updatePageLayout();

collector = new LayoutCollector(doc);
LayoutEnumerator enumerator =  new LayoutEnumerator(doc);

for (Bookmark bm : doc.getRange().getBookmarks()) {
    if (bm.getName().startsWith("bm_")) {
        enumerator.setCurrent(collector.getEntity(bm.getBookmarkStart()));

        String left = String.format("%.2f", enumerator.getRectangle().getX());
        String top = String.format("%.2f", enumerator.getRectangle().getY());

        System.out.println("First Character '" + bm.getText() + "' on " + bm.getName().replace("bm_", "") + "th Page has (x, y) = (" + left + ", " + top + ") ");
    }
}

private static Run SplitRun(Run run, int position) throws Exception {
    Run afterRun = (Run) run.deepClone(true);
    afterRun.setText(run.getText().substring(position));
    run.setText(run.getText().substring(0, position));
    run.getParentNode().insertAfter(afterRun, run);
    return afterRun;
}

beralex · January 13, 2020, 4:51pm

@awais.hafeez,
Good afternoon!
Unfortunately, this option is not suitable for us, because it does not always correctly cover all special cases (for example, when a paragraph is divided into several pages).

We are satisfied with the option of finding the first / last line on the page and working with the resulting text. But we do not know how to solve the following problems:

The problem with the appearance of null in the lines where they are absent.
Getting the coordinates of all characters of the found line

See quote

In addition, I draw attention to the fact that we can not modify the document using Aspose. Words. The document should only be checked for compliance with the rules. The original structure should remain unchanged.

Please help us solve these problems.

awais.hafeez · January 14, 2020, 5:07am

@beralex,

When you convert a Word document to PDF format for example by using the following simple two lines of code, Aspose.Words should preserve/retain all elements in Word document, their position/layout, their formatting etc in generated PDF on its own.

Document doc = new Document(dataDir + "input.doc");
doc.save(dataDir + "output.pdf");

You do not need to write any additional code to calculate/determine the coordinates of different document elements by yourself. If you find any misplacement or content overlapping in generated PDF, that may well be because of some bug in Aspose.Words’ API which needs to be fixed.

Generally, Aspose.Words mimics the behavior of MS Word i.e. if you convert your Word documents (DOC DOCX files etc) to PDF format by using Aspose.Words, the output will look similar to what MS Word produces. We strive hard to ensure that all conversions would have been performed with high fidelity - exactly like Microsoft Word® would have done it. But, still if you find any issues during conversions, please feel free to report in this forum and will be fix the issue(s) in Aspose.Words’ API. Hope, this helps.

beralex · January 14, 2020, 9:27am

@awais.hafeez,
Good afternoon!

Converting to PDF does not interest us. It is important for us to obtain information from a file in the Word format for solving the tasks described above. Converting to PDF will not help us in this case:

Too costly code runtime
Does not provide the functionality that we need

Again. In solving our problems, we encountered the problems described in quote:

beralex:

Unfortunately, this option is not suitable for us, because it does not always correctly cover all special cases (for example, when a paragraph is divided into several pages).

We are satisfied with the option of finding the first / last line on the page and working with the resulting text. But we do not know how to solve the following problems:

The problem with the appearance of null in the lines where they are absent.

Getting the coordinates of all characters of the found line

See quote

beralex:

We tried using the code specified at https://github.com/aspose-words/Aspose.Words-for-Java.
It turned out to get the lines, but a new problem arose: in the text of the lines “null” objects began to appear. And in the document other tags / objects are not visually visible. When receiving text from a paragraph (not a line), zero objects do not appear.

Example:
Source document:ExampleDoc.docx
The result of getting rows on page 4 from the original document (pay attention to the first 2 lines): RenderedDocumentResultPage4.txt

Can you tell me, please, what could be the problem? Files are attached in the archive Example.zip (81.2 KB)

In addition, I draw attention to the fact that we can not modify the document using Aspose. Words. The document should only be checked for compliance with the rules. The original structure should remain unchanged.

Can you help us with these tasks?

Can you help us with their solution? Directly in the form in which the problem is described. Please help us solve these problems.

awais.hafeez · January 15, 2020, 4:38am

@beralex,

We are checking this scenario and will get back to you soon.

beralex · January 16, 2020, 8:16am

@awais.hafeez,

Good afternoon!

There was no understanding what the problem is and how to solve it? We are waiting for an answer very much, because of this the development of our project in terms of the implementation of this task was stopped.

awais.hafeez · January 17, 2020, 3:04am

@beralex,

Please spare us some time. We are checking these scenarios and will get back to you with our findings soon.

awais.hafeez · January 24, 2020, 10:14am

@beralex,

Thanks for being patient. It is to update you that we had logged the following tickets in our issue tracking system and linked them with your thread so that you will be notified as soon as the work on these tickets will be completed.

WORDSNET-19858: Code to measure the distance between body text and the Header/Footer
WORDSNET-19860: Code to get page coordinates of every character on a Line
WORDSJAVA-2297: To fix RenderedDocument example producing unwanted content and ‘NULLs’

We will keep you posted on any further updates.

beralex · February 24, 2020, 6:40pm

@awais.hafeez
Good afternoon!
Have information on the above issues? It seems that the tasks are closed:
WORDSNET-19858 ---- Status : Closed
WORDSNET-19860 ---- Status : Closed
Can you tell me about the results of these tasks?

awais.hafeez · February 25, 2020, 6:00am

@beralex,

It might be possible to find the distance between last inline content in the text column to the first content line in the footer. You may use Aspose.Words’ Layout APIs i.e. LayoutEnumerator + LayoutCollector to achieve this on your end.

Secondly, the Layout model of Aspose.Words does not record character positions, and the layout does not have characters as such but glyphs. So, the requested functionality to ‘get page coordinates of every character on a Line’ is not available.

So, regarding WORDSNET-19858 and WORDSNET-19860, we have completed the work on these issues and come to a conclusion to close them with “Won’t fix” statuses. I am afraid, we will not be able to implement these functionalities in Aspose.Words’ API. We apologize for your inconvenience.

aspose.notifier · August 15, 2022, 5:06am

The issues you have found earlier (filed as WORDSJAVA-2297) have been fixed in this Aspose.Words for Java 22.8 update also available on Maven.