Free Support Forum - aspose.com

Get Coordinates of Text Tables Pictures in Word Document & Convert DOC DOCX to PDF using Java

Good afternoon!
There was a problem working with Aspose.Words for Java (DOCX, DOC documents).
We need to measure the distance from the text (table, image, element) to the footer (see Fig. “Distance between text (table, figure) and header (footer).png”):

  1. The distance from the upper border of the text to the lower border of the header (there is always a table with visible borders in the header)
  2. The distance from the lower border of the text to the upper border of the footer (there is always a table with visible borders in the footer)
    Because we didn’t find how to measure this distance using the standard Aspose.Words for Java functional, we did the following:
  3. Convert a DOCX or DOC file to PDF
  4. Measure the distance pixel by pixel in the resulting file
    But with such an implementation, problems arose:
  5. For example, the file “Example.docx” was taken.
  6. The file “Example.doc” (from point 1) was converted to PDF. The conversion result (Example_1.pdf) is the same as the source file.
  7. The file “Example.doc” (from clause 1) was re-converted to PDF. The conversion result (Example_2.pdf) does not match the source file: on pages 8-11, the header and footer are missing.

Request:

  1. Tell me, please, is there a way to measure the distance from the text (table, image, element) to the header and footer using Aspose.Words for Java tools? Maybe there is a way to access the MS WORD Ruler tool?
  2. What could be the problem with the disappearance of the footers when converting DOC / DOCX to PDF and how to solve it?

All files are attached in the archive Example_document.zip (518.9 KB).

@beralex,

We are working on your query and will get back to you soon.

@beralex,

Please check:

For example, the following code will return coordinates rectangle [(left, top)] and [(width, height)] of all Shapes (images) in Word document:

Document doc = new Document("E:\\Temp\\in.docx");

LayoutCollector collector = new LayoutCollector(doc);
LayoutEnumerator enumerator = new LayoutEnumerator(doc);

for (Shape shape : (Iterable<Shape>) doc.getFirstSection().getBody().getChildNodes(NodeType.SHAPE, true)) {
    enumerator.setCurrent(collector.getEntity(shape));

    String left = String.format("%.2f", enumerator.getRectangle().getX());
    String top = String.format("%.2f", enumerator.getRectangle().getY());
    String width = String.format("%.2f", enumerator.getRectangle().getWidth());
    String height = String.format("%.2f", enumerator.getRectangle().getHeight());

    System.out.print("[(x, y) = (" + left + ", " + top + ")]");
    System.out.println(" AND [(width, height) = (" + width + ", " + height + ")]");
}

You can use the same logic to calculate coordinates of any node in Word document.

Secondly, after an initial test with the licensed latest version of Aspose.Words for Java i.e. 20.1, we were unable to reproduce this issue (as shown in 8~11 pages of “Example_2.pdf”) on our end. We used the following simple code to produce a “awjava-20.1.pdf” on our end:

Java Code:

Document doc = new Document("E:\\Temp\\example_document\\Example.doc");
doc.save("E:\\temp\\example_document\\awjava-20.1.pdf");

So, please upgrade to the latest version i.e. 20.1. Hope, this helps.

Good afternoon!
Thanks, updating to the new version helped.

Question about obtaining coordinates in MS Word: is it possible to get the coordinates of the elements of the first line on each page? If so, can you give an example?

We tried using the code specified at https://github.com/aspose-words/Aspose.Words-for-Java/blob/master/Examples/src/main/java/com/aspose/words/examples/rendering_printing/LayoutEntity .java.
It turned out to get the lines, but a new problem arose: in the text of the lines “null” objects began to appear. And in the document other tags / objects are not visually visible. When receiving text from a paragraph (not a line), zero objects do not appear.

Example:
Source document:ExampleDoc.docx
The result of getting rows on page 4 from the original document (pay attention to the first 2 lines): RenderedDocumentResultPage4.txt

Can you tell me, please, what could be the problem? Files are attached in the archive Example.zip (81.2 KB)

@beralex,

Please see the following code that prints the (x,y) position of first character on every page. Hope, this helps in achieving what you are looking for:

Document doc = new Document("E:\\Temp\\Example\\ExampleDoc.docx");

for (Field field : doc.getRange().getFields())
    field.unlink();

Node[] runs = doc.getChildNodes(NodeType.RUN, true).toArray();
for (int i = 0; i < runs.length; i++)
{
    Run run = (Run)runs[i];
    int length = run.getText().length();

    Run currentNode = run;
    for (int x = 1; x < length; x++)
    {
        currentNode = SplitRun(currentNode, 1);
    }
}

NodeCollection smallRuns = doc.getChildNodes(NodeType.RUN, true);
LayoutCollector collector = new LayoutCollector(doc);

ArrayList list = new ArrayList();
int pageIndex = 1;
for (int i=0; i< smallRuns.getCount() ; i++) {
    Run run = (Run) smallRuns.get(i);
    if (/*!run.getText().trim().equals("") && */collector.getStartPageIndex(run) == pageIndex ) {
        list.add(run);
        pageIndex++;
    }
}

DocumentBuilder builder = new DocumentBuilder(doc);
for (int i =0 ; i< list.size(); i++) {
    Run run = (Run) list.get(i);
    builder.moveTo(run);
    builder.startBookmark("bm_" + i);
    BookmarkEnd end = builder.endBookmark("bm_" + i);
    run.getParentNode().insertAfter(end, run);
}

doc.updatePageLayout();

collector = new LayoutCollector(doc);
LayoutEnumerator enumerator =  new LayoutEnumerator(doc);

for (Bookmark bm : doc.getRange().getBookmarks()) {
    if (bm.getName().startsWith("bm_")) {
        enumerator.setCurrent(collector.getEntity(bm.getBookmarkStart()));

        String left = String.format("%.2f", enumerator.getRectangle().getX());
        String top = String.format("%.2f", enumerator.getRectangle().getY());

        System.out.println("First Character '" + bm.getText() + "' on " + bm.getName().replace("bm_", "") + "th Page has (x, y) = (" + left + ", " + top + ") ");
    }
}  

private static Run SplitRun(Run run, int position) throws Exception {
    Run afterRun = (Run) run.deepClone(true);
    afterRun.setText(run.getText().substring(position));
    run.setText(run.getText().substring(0, position));
    run.getParentNode().insertAfter(afterRun, run);
    return afterRun;
}

@awais.hafeez,
Good afternoon!
Unfortunately, this option is not suitable for us, because it does not always correctly cover all special cases (for example, when a paragraph is divided into several pages).

We are satisfied with the option of finding the first / last line on the page and working with the resulting text. But we do not know how to solve the following problems:

  1. The problem with the appearance of null in the lines where they are absent.
  2. Getting the coordinates of all characters of the found line

See quote

In addition, I draw attention to the fact that we can not modify the document using Aspose. Words. The document should only be checked for compliance with the rules. The original structure should remain unchanged.

Please help us solve these problems.

@beralex,

When you convert a Word document to PDF format for example by using the following simple two lines of code, Aspose.Words should preserve/retain all elements in Word document, their position/layout, their formatting etc in generated PDF on its own.

Document doc = new Document(dataDir + "input.doc");
doc.save(dataDir + "output.pdf");

You do not need to write any additional code to calculate/determine the coordinates of different document elements by yourself. If you find any misplacement or content overlapping in generated PDF, that may well be because of some bug in Aspose.Words’ API which needs to be fixed.

Generally, Aspose.Words mimics the behavior of MS Word i.e. if you convert your Word documents (DOC DOCX files etc) to PDF format by using Aspose.Words, the output will look similar to what MS Word produces. We strive hard to ensure that all conversions would have been performed with high fidelity - exactly like Microsoft Word® would have done it. But, still if you find any issues during conversions, please feel free to report in this forum and will be fix the issue(s) in Aspose.Words’ API. Hope, this helps.

@awais.hafeez,
Good afternoon!

Converting to PDF does not interest us. It is important for us to obtain information from a file in the Word format for solving the tasks described above. Converting to PDF will not help us in this case:

  1. Too costly code runtime
  2. Does not provide the functionality that we need

Again. In solving our problems, we encountered the problems described in quote:

Can you help us with their solution? Directly in the form in which the problem is described. Please help us solve these problems.

@beralex,

We are checking this scenario and will get back to you soon.

@awais.hafeez,

Good afternoon!

There was no understanding what the problem is and how to solve it? We are waiting for an answer very much, because of this the development of our project in terms of the implementation of this task was stopped.

@beralex,

Please spare us some time. We are checking these scenarios and will get back to you with our findings soon.