Numbering is not continuous after PageSplitter

craig.w.su · May 3, 2015, 10:18pm

Hi
I am using Aspose Words 15.3.0 and PageSplitter to Convert a Word file to splitted html page files.

There is a problem:
In the original file, the numbering is A, B, C, D, E, F.
But After converting, it becomes A, B, C in page 1 html file, and then A, B, C again in page 2 html file.

If you need detail, please see the attachement. I put the original file and result in there

And here is my code:

Document doc = new Document("custom/input/docx/20150504013123.docx");
Document pageDoc;
LayoutCollector layoutCollector;
DocumentPageSplitter splitter;
ByteArrayOutputStream output = new ByteArrayOutputStream();
HtmlSaveOptions saveOp = new HtmlSaveOptions();
saveOp.setExportImagesAsBase64(true);
saveOp.setExportTextInputFormFieldAsText(false);
saveOp.setExportTocPageNumbers(true);
saveOp.setExportPageSetup(true);
saveOp.setExportDocumentProperties(true);
saveOp.setExportRelativeFontSize(false);

layoutCollector = new LayoutCollector(doc);
doc.updatePageLayout();
splitter = new DocumentPageSplitter(layoutCollector);

byte[] outputContent;
String outputPath = "custom/output/docx";
String blockID = UUID.randomUUID().toString();

for (int page = 1; page <= doc.getPageCount(); page++)
{
    pageDoc = splitter.getDocumentOfPage(page);
    output.reset();
    pageDoc.save(output, saveOp);
    outputContent = output.toByteArray();
    File outputDir = new File(outputPath + "/" + blockID + "/");
    if (!outputDir.exists())
        outputDir.mkdir();
    IOUtils.write(outputContent, new FileOutputStream(
    outputPath + "/" + blockID + "/" + page +".html"));

}

Is there any way to make the result just like the original file shows?
Please help me to solve this problem, thanks

muhammad.ijaz · May 4, 2015, 9:24pm

Hi Craigabyss,
There is no page break in your input document. If you insert the page breaks at the end of all pages (as you can see in attached updated document), you will be able to set following save option to convert each page to a separate HTML file without using PageSplitter and output is also fine in this case.
saveOp.setDocumentSplitCriteria(com.aspose.words.DocumentSplitCriteria.PAGE_BREAK);
Best Regards,

craig.w.su · May 4, 2015, 10:50pm

Hi
Thanks for your advice!

And There are still problems

1.How to insert page breaks at the end of all pages in code instead of opening it by Office Word?

2.If there are some pages with pagebreaks and some are not, how to detect it and insert page breaks into proper pages to achieve this purpose?

muhammad.ijaz · May 5, 2015, 9:54pm

Hi Craigabyss,
We are working on the code example for your scenario and will update you soon.
Best Regards,

muhammad.ijaz · June 5, 2015, 6:27am

Hi Craigabyss,
Once you split document using PageSplitter, second page becomes another document and has no connection with the list of first page. You will have to start numbering of the list (on page two) from the ending point of list on page one (4 in this case). Following code can be used.
Aspose.Words.Lists.List list = doc.Lists[0];

// Completely customize one list level.
ListLevel level1 = list.ListLevels[0];
level1.StartAt = 4;

Best Regards,

craig.w.su · September 4, 2016, 8:44pm

Hi Muhammad Ijaz

Thanks for your information.

How to detect a cross-page list and obtain the last list number in the previous page (4 in this case) programmatically?

tahir.manzoor · September 5, 2016, 11:48am

Hi Craig,

Thanks for your inquiry. Please check the following code snippet to get the numeric value of last label in the document. Hope this helps you.

pageDoc = splitter.GetDocumentOfPage(page);
pageDoc.updateListLabels();
int labelvalue = 0;
Node[] nodes = pageDoc.getChildNodes(NodeType.PARAGRAPH, true).toArray();
for (int i = nodes.length - 1; i >= 0; i--)
{
    Paragraph paragraph = (Paragraph)nodes[i];
    if (paragraph.getListFormat().isListItem())
    {
        ListLabel label = paragraph.getListLabel();
        labelvalue = label.getLabelValue();
        break;
    }
}
System.out.println(labelvalue);

craig.w.su · October 26, 2016, 4:24am

Hi Tahir Manzoor

Thanks for your information!

Here is the code we modified for conversion test:

Document doc = new Document("custom/input/docx/20150504013123.docx");
Document pageDoc;
LayoutCollector layoutCollector;
DocumentPageSplitter splitter;
ByteArrayOutputStream output = new ByteArrayOutputStream();
HtmlSaveOptions saveOp = new HtmlSaveOptions();
saveOp.setExportImagesAsBase64(true);
saveOp.setExportTextInputFormFieldAsText(false);
saveOp.setExportTocPageNumbers(true);
saveOp.setExportPageSetup(true);
saveOp.setExportDocumentProperties(true);
saveOp.setExportRelativeFontSize(false);

layoutCollector = new LayoutCollector(doc);
doc.updatePageLayout();
splitter = new DocumentPageSplitter(layoutCollector);

byte[] outputContent;
String outputPath = "custom/output/docx";
String blockID = UUID.randomUUID().toString();

Integer priviousListLevel = null;
for (int page = 1; page <= doc.getPageCount(); page++)
{
    pageDoc = splitter.getDocumentOfPage(page);

    pageDoc.updateListLabels();

    if (priviousListLevel != null)
    {
        System.out.println("aaa" + priviousListLevel);
        com.aspose.words.List list = pageDoc.getLists().get(0);
        list.getListLevels().get(0).setStartAt(priviousListLevel + 1);
    }

    priviousListLevel = null;
    int labelvalue = 0;
    Node[] nodes = pageDoc.getChildNodes(NodeType.PARAGRAPH, true).toArray();
    for (int i = nodes.length - 1; i >= 0; i–)
    {
        Paragraph paragraph = (Paragraph)nodes[i];
        if (paragraph.getListFormat().isListItem())
        {
            ListLabel label = paragraph.getListLabel();
            labelvalue = label.getLabelValue();
            priviousListLevel = labelvalue;
            break;
        }
    }

    output.reset();
    pageDoc.save(output, saveOp);

    outputContent = output.toByteArray();

    File outputDir = new File(outputPath + "/" + blockID + "/");
    if (!outputDir.exists())
        outputDir.mkdir();

    IOUtils.write(outputContent, new FileOutputStream(outputPath + "/" + blockID + "/" + page + ".html"));

}

But it doe not fix this issue.
Could you please check this? Maybe there is something wrong about this segment of the code

And We’ve got another problem:
How to determined if a list in the page is connected with the list in the previous page?

There will be another situation like the word file I newly upload in the attachment.
That is, the two list in each page are separated at first, and the the code will be not suitable for this case.

Thanks for your help,

Craig

tahir.manzoor · October 27, 2016, 2:14am

Hi Craig,

Thanks for your inquiry. In this case, you need to get the list of first paragraph and set the starting number for list level. Please use following modified code example to get the desired output. Hope this helps you.

Document doc = new Document(MyDir + "20150504013123_2.docx");
Document pageDoc;
LayoutCollector layoutCollector;
DocumentPageSplitter splitter;
ByteArrayOutputStream output = new ByteArrayOutputStream();
HtmlSaveOptions saveOp = new HtmlSaveOptions();
saveOp.setExportImagesAsBase64(true);
saveOp.setExportTextInputFormFieldAsText(false);
saveOp.setExportTocPageNumbers(true);
saveOp.setExportPageSetup(true);
saveOp.setExportDocumentProperties(true);
saveOp.setExportRelativeFontSize(false);
layoutCollector = new LayoutCollector(doc);
doc.updatePageLayout();
splitter = new DocumentPageSplitter(layoutCollector);
byte[] outputContent;
String outputPath = "custom/output/docx";
String blockID = UUID.*randomUUID*().toString();
Integer priviousListLevel = null;
for (int page = 1; page <= doc.getPageCount(); page++) {
    pageDoc = splitter.GetDocumentOfPage(page);
    pageDoc.updateListLabels();
    if (priviousListLevel != null) {
        for (Paragraph para :  (Iterable)pageDoc.getChildNodes(NodeType.PARAGRAPH, true))
        {
            if(para.isListItem())
            {
                com.aspose.words.List list = para.getListFormat().getList();
                list.getListLevels().get(0).setStartAt(priviousListLevel+1);
                break;
            }
        }
    }
    priviousListLevel = null;
    int labelvalue = 0;
    Node[] nodes = pageDoc.getChildNodes(NodeType.PARAGRAPH, true).toArray();
    for (int i = nodes.**length** - 1; i >= 0; i--) {
        Paragraph paragraph = (Paragraph) nodes[i];
        if (paragraph.getListFormat().isListItem()) {
            ListLabel label = paragraph.getListLabel();
            labelvalue = label.getLabelValue();
            priviousListLevel = labelvalue;
            break;
        }
    }
    pageDoc.save(MyDir + "Out_"+page+".html", saveOp);
}

craig.w.su · January 23, 2017, 11:47pm

Hi Tahir.Manzoor

Here is our test code for fixing problem about numbers of multi-level list:

@Test
public void testWithAsposeForList()
{
    try
    {
        Document doc = new Document("custom/input/docx/分項符號2.docx");
        Document pageDoc;
        LayoutCollector layoutCollector;
        DocumentPageSplitter splitter;
        ByteArrayOutputStream output = new ByteArrayOutputStream();
        HtmlSaveOptions saveOp = new HtmlSaveOptions();
        saveOp.setExportImagesAsBase64(true);
        saveOp.setExportTextInputFormFieldAsText(false);
        saveOp.setExportTocPageNumbers(true);
        saveOp.setExportPageSetup(true);
        saveOp.setExportDocumentProperties(true);
        saveOp.setExportRelativeFontSize(false);

        layoutCollector = new LayoutCollector(doc);
        doc.updatePageLayout();
        splitter = new DocumentPageSplitter(layoutCollector);

        byte[] outputContent;
        String outputPath = "custom/output/docx";
        String blockID = UUID.randomUUID().toString();

        Table<Integer, Integer, Integer> maxListLevelMap = HashBasedTable.create();
        for (int page = 1; page <= doc.getPageCount(); page++)
        {
            System.out.println("page:" + page);
            pageDoc = splitter.getDocumentOfPage(page);

            pageDoc.updateListLabels();
            Node[] nodes = pageDoc.getChildNodes(NodeType.PARAGRAPH, true).toArray();
            // which list level has been marked
            int listIdToCorrect = -1;
            int listLevelToCorrect = -1;
            for (Node node : nodes) {
    Paragraph paragraph = (Paragraph)node;
    ListFormat listFormat = paragraph.getListFormat();
    if (listFormat.isListItem() == false)
    {
        continue;
    }
    ListLabel listLabel = paragraph.getListLabel();
    int listId = listFormat.getList().getListId();
    int listLevel = listFormat.getListLevelNumber();
    Integer listLabelValue = maxListLevelMap.get(listId, listLevel);
    // save new list value
    listLabelValue = (listLabelValue == null ? listLabel.getLabelValue() : listLabelValue + 1);
    maxListLevelMap.put(listId, listLevel, listLabelValue);
    // save first list id and level
    if (listIdToCorrect < 0)
    {
        listIdToCorrect = listId;
        listLevelToCorrect = listLevel;
    }
    // reset list item’s startAt
    if (page > 1 && listIdToCorrect == listId && listLevel <= listLevelToCorrect && listLabelValue != null)
    {
        int strangeOffset = listLabel.getLabelValue() - 1;
        listFormat.getListLevel().setStartAt(listLabelValue - strangeOffset);
        System.out.println(strangeOffset);
        System.out.println("change~~~" + listId + ";" + listLevel + ";" + listLabelValue);
        // correct smaller level only at next
        listLevelToCorrect = listLevel - 1;
    }
}
System.out.println(maxListLevelMap);
output.reset();
pageDoc.save(output, saveOp);
outputContent = output.toByteArray();

File outputDir = new File(outputPath + "/" + blockID + "/");
if (!outputDir.exists())
    outputDir.mkdir();

IOUtils.write(outputContent, new FileOutputStream(outputPath + "/" + blockID + "/" + page + ".html"));

}
} catch (Exception e)
{
    e.printStackTrace();
}
}

This segment of code can generate correct html one page after another.
Within generating html for one page of this series, the list information is also generated from previous pages.

However we need a kind of method that can generating html for only one page directly in the middle of the document.

Is there a way to know the integers for the list to start at in specific page, without accessing the lists from the beginning of the document?
(For the optimization of speed and memory usage)

Craig

tahir.manzoor · January 25, 2017, 12:52am

Hi Craig,

Thanks for your inquiry. The Aspose.Words.Layout namespace provides classes that allow to access information such as on what page and where on a page particular document elements are positioned, when the document is formatted into pages.

In this case, we suggest you following solution.

Iterate through all sections of document.
Get the paragraphs of a Section.
Iterate through all paragraphs and use LayoutCollector.GetStartPageIndex method to get the page number of paragraph.
Once you get the paragraph of your desired page number, get the list number as you are doing in your code.
Extract the document’s page using PageSplitter utility.
Use ListLevel.StartAt property to set the starting number for the list.

Hope this helps you. Please let us know if you have any more queries.

craig.w.su · February 8, 2017, 9:59pm

Hi Tahir.Manzoor

As far as I understand, If I need a single HTML page of page 10 in a 10 page Word document,
it is necessary and unavoidable to iterate all the paragraphs before page 10 to get the correct starting number for the list in page 10, right?

Craig

tahir.manzoor · February 11, 2017, 9:14am

Hi Craig,

Thanks for your inquiry. Please note that MS Word document is flow document and does not contain any information about its layout into lines and pages. Therefore, technically there is no “Page”, “Line” concept in Word document. Pages and lines are created by Microsoft Word on the fly.

Yes, your understanding is correct about getting list number of a paragraph for specific page. Please let us know if you have any more queries.