Offset information in Docx

Kusumanchi.Rajesh · August 18, 2015, 2:17am

Hi ,

I have a requirement in which I need the offset location for a word in the docx file in java.

For suppose in a paragraph the word “Aspose Words” is repeating like below:

Aspose.Words 15.7.0 has been released. This month’s release contains over 115 useful new features, enhancements and bug fixes to the Aspose.Words products. You can download the latest releases of Aspose.Words from the following links: Aspose.Words for .NET 15.7.0 Aspose.Words

i need the offset information of each of the “Aspose Words” by traversing a line by line and in header also.

Is there any way to do this in Aspose Words.

i have attached a docx file with header and body.

Thanks
Rajesh

tahir.manzoor · August 18, 2015, 6:40am

Hi Rajesh,

Thanks for your inquiry. In your case, I suggest you please check the attached “documentLayoutHelper”
example project. This sample demonstrates how to easily work with the
layout elements of a document and access the pages, lines, spans etc.

Please use following code example to get the text of each line on a page. Once you have text of each line, you can find the desired text from line’s text. Hope this helps you. Please let us know if you have any more queries.

Document doc = new Document(MyDir + "Offset.docx");
// Create a new RenderedDocument class from a Document object.
RenderedDocument layoutDoc = new RenderedDocument(doc);
String text = "Aspose.Words";
// Loop through each page in the document and print how many lines appear on each page.
for (RenderedPage page : layoutDoc.getPages())
{
    System.out.println("Page number : " + page.getPageIndex());
    LayoutCollection lines = page.GetChildEntities(LayoutEntityType.LINE, true);
    for (LayoutEntity line : (Iterable)lines)
    {
        if (line.getText().trim().contains(text))
        {
            System.out.println("Line text : " + line.getText());
            System.out.println("Offset information of text : " + line.getText().indexOf(text));
        }
    }
}

Kusumanchi.Rajesh · August 18, 2015, 9:02am

Hi Manzoor,

I have facing issues for the above JAVA code which you have shared.

In Aspose.Words there is no RenderedDocument API and its showing the exception “RenderedDocument cannot be resolved to a type”.

And We are using the latest version jars of Aspose.Words 15.7.0.

Is this Api in Aspose.Words or any other API?

Thanks,
Rajesh

tahir.manzoor · August 18, 2015, 12:42pm

Hi Rajesh,

Thanks for your inquiry. Please check the “documentLayoutHelper.zip”
from my previous post. This zip file contains RenderedDocument.java and LayoutEntity.java. Please include these files in your example project.

Kusumanchi.Rajesh · August 19, 2015, 5:58am

Hi Tahir,

Thanks for Sharing the valuable information but i have one concern if the input string is moved to two lines(i.e 1st half of the string is at the end of the line and second half is at the start of the next line) for example see the paragraph
(Here we need the offset for **Abc. II, & 2, de.111)**Florida’s Abc. II, & 2, de.111 electors in the Aspose vote Aspose words Aspose words Aspose words Aspose words of the Abc. II, & 2, de.111 Amici
have a Words right the Art. Florida’s electors in the Aspose.Words vote of the the
and electoral Abc. II, & 2,
de.111 and the Latest version is 15.7.0

the expected output is(Not exact Count) :
Page number :1
1st Occurence: Start offset of text : 10 End Offest of text: 30
2nd Occurence: Start offset of text : 66 End Offest of text: 86
3rd Occurence: Start offset of text : 191 End Offest of text: 201

and if possible can you share me the Line no
For reference i have atteched the input file.docx
Thanks,
Rajesh

tahir.manzoor · August 20, 2015, 6:02am

Hi Rajesh,

Thanks for your inquiry. I have modified the code according to your requirements. Please check following code example. Hope this helps you.

Please note that with documentLayoutHelper utility, you can read the text of each line of page body and also header/footer. Once you have text of each line, you just need to write Java code according to your requirements. Please let us know if you have any more queries.

Document doc = new Document(dataDir + "inputfile+(2).docx");
// Create a new RenderedDocument class from a Document object.
RenderedDocument layoutDoc = new RenderedDocument(doc);
String text = "Abc. II, & 2, de.111";
// Loop through each page in the document and print how many lines appear on each page.
for (RenderedPage page : layoutDoc.getPages())
{
    System.out.println("Page number : " + page.getPageIndex());
    LayoutCollection lines = page.GetChildEntities(LayoutEntityType.LINE, true);
    String linetext = "";
    String nextlinetext = "";
    for (int i = 0; i < lines.getCount() - 1; i++)
    {
        LayoutEntity line = lines.getItem(i);
        LayoutEntity nextline = lines.getItem(i + 1);
        linetext = line.getText().replace("null", "");
        nextlinetext = nextline.getText().replace("null", "");
        String bothlines = linetext + nextlinetext;
        if (bothlines.contains(text))
        {
            int index = bothlines.indexOf(text);
            while (index >= 0)
            {
                System.out.print(index);
                System.out.println(" -> " + linetext);
                if (index > linetext.length())
                    index = bothlines.indexOf(text, index + 1);
                else
                    index = linetext.indexOf(text, index + 1);
            }
            System.out.println("---------------------------------");
        }
    }

tahir.manzoor · August 21, 2015, 3:08am

Hi Rajesh,

Thanks
for your inquiry via live chat. The Aspose.Words.Layout namespace provides classes that allow to access information such as on what page and where on a page particular document elements are positioned, when the document is formatted into pages.

Please check following highlighted code snippet to get the line number of a line. Note that with documentLayoutHelper utility, you can get the text of each line of a page using RenderedPage.GetChildEntities(LayoutEntityType.LINE, true) method. However, to work with line’s text, you need to modify your Java code according to your requirements.

Please feel free to ask if you have any question about Aspose.Words, we will be happy to help you.

Document doc = new Document(dataDir + "inputfile.docx");
// Create a new RenderedDocument class from a Document object.
RenderedDocument layoutDoc = new RenderedDocument(doc);
String text = "Abc. II, & 2, de.111";
// Loop through each page in the document and print how many lines appear on each page.
for (RenderedPage page : layoutDoc.getPages())
{
    System.out.println("Page number : " + page.getPageIndex());
    LayoutCollection lines = page.GetChildEntities(LayoutEntityType.LINE, true);
    int linenumber = 1;
    Boolean isPagecomplete = false;
    String linetext = "";
    String nextlinetext = "";
    for (int i = 0; i < lines.getCount() - 1; i++)
    {
        LayoutEntity line = lines.getItem(i);
        LayoutEntity nextline = lines.getItem(i + 1);
        linetext = line.getText().replace("null", "");
        nextlinetext = nextline.getText().replace("null", "");
        String bothlines = linetext + nextlinetext;
        if (bothlines.contains(text))
        {
            if (linetext.contains("Section Break"))
                isPagecomplete = true;
            if (isPagecomplete == false)
                System.out.println("line number : " + linenumber);
            int index = bothlines.indexOf(text);
            while (index >= 0)
            {
                System.out.print(index);
                System.out.println(" -> " + linetext);
                if (index > linetext.length())
                    index = bothlines.indexOf(text, index + 1);
                else
                    index = linetext.indexOf(text, index + 1);
            }
            System.out.println("---------------------------------");
        }
        if (isPagecomplete == false)
            linenumber++;
    }

Kusumanchi.Rajesh · September 7, 2015, 8:26am

HI Tahir,

This is the snippet I had written for fetching footnotes
for (Paragraph para : (Iterable) doc.getChildNodes(
NodeType.PARAGRAPH, true)) {

try {
if (para.getAncestor(NodeType.FOOTNOTE).getNodeType() == NodeType.FOOTNOTE) {

I need to find the page number for each of these footnote paragraph fetched.
How to do this?

tahir.manzoor · September 8, 2015, 6:40am

Hi Rajesh,

Thanks
for your inquiry and sharing the detail via live chat.

In your case, I suggest you please use PageSplitter code to achieve your requirements. Please check “PageSplitter” example project in Aspose.Words for Java examples repository at GitHub.

Please use following code example to get the footnote’s count on each page of Word document. Hope this helps you.

Document doc = new Document(MyDir + "Sample_input.docx");
LayoutCollector layoutCollector = new LayoutCollector(doc);
doc.updatePageLayout();
DocumentPageSplitter splitter = new DocumentPageSplitter(layoutCollector);
for (int i = 1; i <= doc.getPageCount(); i++)
{
    Document pageDoc = splitter.getDocumentOfPage(i);
    System.out.println("Footnotes at Page " + i + " = " + pageDoc.getChildNodes(NodeType.FOOTNOTE, true).getCount());
}

Kusumanchi.Rajesh · September 8, 2015, 8:10am

Hi Tahir,
Thanks for sharing the code

But its not working as per what we were discussed…

Current output is :
Footnotes at Page 1 = 10
Footnotes at Page 2 = 10
Footnotes at Page 3 = 10
Footnotes at Page 4 = 10

Expected output is :(what we were discussed output)
Footnotes at Page 1 = 2
Footnotes at Page 2 = 2
Footnotes at Page 3 = 3
Footnotes at Page 4 = 3

i have attached the document for the above output’s

Thnks,
Rajesh

tahir.manzoor · September 8, 2015, 11:22am

Hi Rajesh,

Thanks
for your inquiry. I have tested the scenario using latest version of Aspose.Words for Java 15.8.0 and have not found the shared issue. Please use Aspose.Words for Java 15.8.0 and let us know if you have any more queries.

Kusumanchi.Rajesh · September 30, 2015, 3:32am

Hi Tahir,

I have used the following code to fetch the page information.

DocumentPageSplitter splitter = new DocumentPageSplitter(collector);
TxtSaveOptions options = new TxtSaveOptions();
options.setExportHeadersFooters(false);
for (int i = 1; i <= doc.getPageCount(); i++)
{
    Document pageDoc = splitter.getDocumentOfPage(i);
    for (Section section : pageDoc.getSections())
        section.getHeadersFooters().clear();
    System.out.println("Text of Page " + i + " = " + pageDoc.toString(options).replaceAll("\r", ""));
}

But I am not getting the desired output. The paragraph in the second page is being considered to be present in the first page.

The input file is attached.

Thanks,
Rajesh

tahir.manzoor · September 30, 2015, 7:13am

Hi Rajesh,

Thanks
for your inquiry. I have tested the scenario and have noticed the same issue. This issue is related to page layout of the document. Could you please share the source of document which is used to create this document?

Please note that Aspose.Words mimics the same behavior as MS Word does. If you convert Word document into fixed page formats e.g Pdf, XPS, you will get the same output. MS Word also changes the page layout of the document when it is saved to fixed page formats e.g Pdf. The output Pdf file generated by MS Word have one page. Please check the attachments.

Kusumanchi.Rajesh · October 9, 2015, 1:59am

Hi Tahir,
The input document and the required output file is attached to the mail. Plz provide a suitable code for achieving this.

Regards,
Rajesh

tahir.manzoor · October 11, 2015, 1:06pm

Hi Rajesh,

Thanks
for your inquiry. I suggest you please read about Aspose.Words Document Object Model from here:
https://docs.aspose.com/words/java/aspose-words-document-object-model/

Please check the attached DOM image of your input document. HeaderFooter is a section-level node and can only be a child of Section. There can only be one HeaderFooter or each HeaderFooterType in a Section.

The contents of header/footer (Page and Numpages fields) are repeated to each page. However, they exists in header/footer only once. The page and numpages fields are updated accordingly.

In your case, I suggest you following solution.

Copy the footer content’s to empty document.
Move the cursor to the start of document
Insert page breaks equal to total number of pages of input document (input.docx).
and use following code example to get the text of footer.

Hope this helps you. Please let us know if you have any more queries.

RenderedDocument layoutDoc = new RenderedDocument(doc);
for (RenderedPage page : layoutDoc.getPages())
{ 
    System.out.println("Page number : " + page.getPageIndex() + "---> " + page.getText());
}

Kusumanchi.Rajesh · October 14, 2015, 4:50am

Hi Tahir,

In the thread, U have shared me a code to get the text of each page separately. It is giving the following exception

Exception in thread "main" java.lang.NullPointerException
at test.main.SectionSplitter.visitParagraphStart(SectionSplitter.java:85)
at com.aspose.words.Paragraph.zzZ(Unknown Source)
at com.aspose.words.CompositeNode.acceptCore(Unknown Source)
at com.aspose.words.Paragraph.accept(Unknown Source)
at com.aspose.words.CompositeNode.acceptChildren(Unknown Source)
at com.aspose.words.CompositeNode.acceptCore(Unknown Source)
at com.aspose.words.Body.accept(Unknown Source)
at com.aspose.words.CompositeNode.acceptChildren(Unknown Source)
at com.aspose.words.CompositeNode.acceptCore(Unknown Source)
at com.aspose.words.Section.accept(Unknown Source)
at com.aspose.words.CompositeNode.acceptChildren(Unknown Source)
at com.aspose.words.CompositeNode.acceptCore(Unknown Source)
at com.aspose.words.Document.accept(Unknown Source)
at test.main.PageNumberFinder.SplitNodesAcrossPages(PageNumberFinder.java:91)
at test.main.DocumentPageSplitter.(DocumentPageSplitter.java:24)

Regards,
Rajesh

Kusumanchi.Rajesh · October 15, 2015, 2:21am

int page_endoffset = 0;
int page_startoffset = 0;
DocumentPageSplitter splitter = new DocumentPageSplitter(collector);
TxtSaveOptions options = new TxtSaveOptions();
options.setExportHeadersFooters(false);
for (int i = 1; i <= doc.getPageCount(); i++)
{
    Document pageDoc = splitter.getDocumentOfPage(i);
    for (Section section : pageDoc.getSections())
        section.getHeadersFooters().clear();
    String page_text = pageDoc.toString(options).replaceAll("\r", "");
    page_endoffset = page_startoffset + page_text.length() - 1;
    System.out.println();
    System.out.println("");
    System.out.println("PageNumber");
    System.out.println("" + page_startoffset + "");
    if (doc.getPageCount() == 1)
        System.out.println("" + (page_endoffset - 1) + "");
    else
        System.out.println("" + (page_endoffset - 1) + "");
    System.out.println("" + i + "");
    System.out.println("" + i + "");
    System.out.println("");
    page_startoffset = page_endoffset;

Kusumanchi.Rajesh · October 15, 2015, 2:42am

Hi Tahir,

I would like the code to perform the following functions

It should avoid reading the header and footer in the document.
It should print the exact content of each page as we see it in the word document.
It should be able to handle different formats of page numbers. For example, Page a of 1, Page 1 of 1, roman page numbers, etc.
It should be able to handle section breaks and page breaks. In case of section breaks, it should print the content of the page as visible in the word document page.

Thanks and Regards,
Rajesh

tahir.manzoor · October 15, 2015, 3:27am

Hi Rajesh,

Thanks for sharing the detail.

*Kusumanchi.Rajesh:

In the thread, U have shared me a code to get the text of each page separately. It is giving the following exception*

Please use the attached modified method of visitParagraphStart in SectionSplitter class. This will fix this exception.

*Kusumanchi.Rajesh:

It should avoid reading the header and footer in the document.*

Please use HeaderFooterCollection.clear method to remove all nodes from this collection and from the document.

*Kusumanchi.Rajesh:

2)It should print the exact content of each page as we see it in the word document.
4)It should be able to handle section breaks and page breaks. In case of section breaks, it should print the content of the page as visible in the word document page.*

As you are using PageSplitter and documentLayoutHelper utility, this works without any issue. If you face any issue with these utilities, please let us know.

*Kusumanchi.Rajesh:

3)It should be able to handle different formats of page numbers. For example, Page a of 1, Page 1 of 1, roman page numbers, etc.*

You are using PageSplitter utility to convert each page to separate document. In this case, there will only be one header/footer. You can get the text of header/footer by using Node.toString method.

If you still face problem, please share following detail for investigation purposes.

Please attach your input Word document.
Please create a standalone/runnable simple Java application that demonstrates the code (Aspose.Words code) you used to generate your output document
Please attach the output document file that shows the undesired behavior.
Please attach your target document showing the desired behavior. I will investigate as to how you are expecting your final document be generated like.