Numbering is not continuous after PageSplitter

Hi
I am using Aspose Words 15.3.0 and PageSplitter to Convert a Word file to splitted html page files.

There is a problem:
In the original file, the numbering is A, B, C, D, E, F.
But After converting, it becomes A, B, C in page 1 html file, and then A, B, C again in page 2 html file.

If you need detail, please see the attachement. I put the original file and result in there :slight_smile:

And here is my code:

Document doc = new Document("custom/input/docx/20150504013123.docx");
Document pageDoc;
LayoutCollector layoutCollector;
DocumentPageSplitter splitter;
ByteArrayOutputStream output = new ByteArrayOutputStream();
HtmlSaveOptions saveOp = new HtmlSaveOptions();
saveOp.setExportImagesAsBase64(true);
saveOp.setExportTextInputFormFieldAsText(false);
saveOp.setExportTocPageNumbers(true);
saveOp.setExportPageSetup(true);
saveOp.setExportDocumentProperties(true);
saveOp.setExportRelativeFontSize(false);

layoutCollector = new LayoutCollector(doc);
doc.updatePageLayout();
splitter = new DocumentPageSplitter(layoutCollector);

byte[] outputContent;
String outputPath = "custom/output/docx";
String blockID = UUID.randomUUID().toString();

for (int page = 1; page <= doc.getPageCount(); page++)
{
    pageDoc = splitter.getDocumentOfPage(page);
    output.reset();
    pageDoc.save(output, saveOp);
    outputContent = output.toByteArray();
    File outputDir = new File(outputPath + "/" + blockID + "/");
    if (!outputDir.exists())
        outputDir.mkdir();
    IOUtils.write(outputContent, new FileOutputStream(
    outputPath + "/" + blockID + "/" + page +".html"));

}

Is there any way to make the result just like the original file shows?
Please help me to solve this problem, thanks :slight_smile:

Hi Craigabyss,
There is no page break in your input document. If you insert the page breaks at the end of all pages (as you can see in attached updated document), you will be able to set following save option to convert each page to a separate HTML file without using PageSplitter and output is also fine in this case.
saveOp.setDocumentSplitCriteria(com.aspose.words.DocumentSplitCriteria.PAGE_BREAK);
Best Regards,

Hi
Thanks for your advice!

And There are still problems

1.How to insert page breaks at the end of all pages in code instead of opening it by Office Word?

2.If there are some pages with pagebreaks and some are not, how to detect it and insert page breaks into proper pages to achieve this purpose?

Hi Craigabyss,
We are working on the code example for your scenario and will update you soon.
Best Regards,

Hi Craigabyss,
Once you split document using PageSplitter, second page becomes another document and has no connection with the list of first page. You will have to start numbering of the list (on page two) from the ending point of list on page one (4 in this case). Following code can be used.
Aspose.Words.Lists.List list = doc.Lists[0];

// Completely customize one list level.
ListLevel level1 = list.ListLevels[0];
level1.StartAt = 4;

Best Regards,

Hi Muhammad Ijaz

Thanks for your information.

How to detect a cross-page list and obtain the last list number in the previous page (4 in this case) programmatically?

Hi Craig,

Thanks for your inquiry. Please check the following code snippet to get the numeric value of last label in the document. Hope this helps you.

pageDoc = splitter.GetDocumentOfPage(page);
pageDoc.updateListLabels();
int labelvalue = 0;
Node[] nodes = pageDoc.getChildNodes(NodeType.PARAGRAPH, true).toArray();
for (int i = nodes.length - 1; i >= 0; i--)
{
    Paragraph paragraph = (Paragraph)nodes[i];
    if (paragraph.getListFormat().isListItem())
    {
        ListLabel label = paragraph.getListLabel();
        labelvalue = label.getLabelValue();
        break;
    }
}
System.out.println(labelvalue);

Hi Tahir Manzoor

Thanks for your information!

Here is the code we modified for conversion test:

Document doc = new Document("custom/input/docx/20150504013123.docx");
Document pageDoc;
LayoutCollector layoutCollector;
DocumentPageSplitter splitter;
ByteArrayOutputStream output = new ByteArrayOutputStream();
HtmlSaveOptions saveOp = new HtmlSaveOptions();
saveOp.setExportImagesAsBase64(true);
saveOp.setExportTextInputFormFieldAsText(false);
saveOp.setExportTocPageNumbers(true);
saveOp.setExportPageSetup(true);
saveOp.setExportDocumentProperties(true);
saveOp.setExportRelativeFontSize(false);

layoutCollector = new LayoutCollector(doc);
doc.updatePageLayout();
splitter = new DocumentPageSplitter(layoutCollector);

byte[] outputContent;
String outputPath = "custom/output/docx";
String blockID = UUID.randomUUID().toString();

Integer priviousListLevel = null;
for (int page = 1; page <= doc.getPageCount(); page++)
{
    pageDoc = splitter.getDocumentOfPage(page);

    pageDoc.updateListLabels();

    if (priviousListLevel != null)
    {
        System.out.println("aaa" + priviousListLevel);
        com.aspose.words.List list = pageDoc.getLists().get(0);
        list.getListLevels().get(0).setStartAt(priviousListLevel + 1);
    }

    priviousListLevel = null;
    int labelvalue = 0;
    Node[] nodes = pageDoc.getChildNodes(NodeType.PARAGRAPH, true).toArray();
    for (int i = nodes.length - 1; i >= 0; i–)
    {
        Paragraph paragraph = (Paragraph)nodes[i];
        if (paragraph.getListFormat().isListItem())
        {
            ListLabel label = paragraph.getListLabel();
            labelvalue = label.getLabelValue();
            priviousListLevel = labelvalue;
            break;
        }
    }

    output.reset();
    pageDoc.save(output, saveOp);

    outputContent = output.toByteArray();

    File outputDir = new File(outputPath + "/" + blockID + "/");
    if (!outputDir.exists())
        outputDir.mkdir();

    IOUtils.write(outputContent, new FileOutputStream(outputPath + "/" + blockID + "/" + page + ".html"));

}

But it doe not fix this issue.
Could you please check this? Maybe there is something wrong about this segment of the code

And We’ve got another problem:
How to determined if a list in the page is connected with the list in the previous page?

There will be another situation like the word file I newly upload in the attachment.
That is, the two list in each page are separated at first, and the the code will be not suitable for this case.

Thanks for your help,

Craig

Hi Craig,

Thanks for your inquiry. In this case, you need to get the list of first paragraph and set the starting number for list level. Please use following modified code example to get the desired output. Hope this helps you.

Document doc = new Document(MyDir + "20150504013123_2.docx");
Document pageDoc;
LayoutCollector layoutCollector;
DocumentPageSplitter splitter;
ByteArrayOutputStream output = new ByteArrayOutputStream();
HtmlSaveOptions saveOp = new HtmlSaveOptions();
saveOp.setExportImagesAsBase64(true);
saveOp.setExportTextInputFormFieldAsText(false);
saveOp.setExportTocPageNumbers(true);
saveOp.setExportPageSetup(true);
saveOp.setExportDocumentProperties(true);
saveOp.setExportRelativeFontSize(false);
layoutCollector = new LayoutCollector(doc);
doc.updatePageLayout();
splitter = new DocumentPageSplitter(layoutCollector);
byte[] outputContent;
String outputPath = "custom/output/docx";
String blockID = UUID.*randomUUID*().toString();
Integer priviousListLevel = null;
for (int page = 1; page <= doc.getPageCount(); page++) {
    pageDoc = splitter.GetDocumentOfPage(page);
    pageDoc.updateListLabels();
    if (priviousListLevel != null) {
        for (Paragraph para :  (Iterable)pageDoc.getChildNodes(NodeType.PARAGRAPH, true))
        {
            if(para.isListItem())
            {
                com.aspose.words.List list = para.getListFormat().getList();
                list.getListLevels().get(0).setStartAt(priviousListLevel+1);
                break;
            }
        }
    }
    priviousListLevel = null;
    int labelvalue = 0;
    Node[] nodes = pageDoc.getChildNodes(NodeType.PARAGRAPH, true).toArray();
    for (int i = nodes.**length** - 1; i >= 0; i--) {
        Paragraph paragraph = (Paragraph) nodes[i];
        if (paragraph.getListFormat().isListItem()) {
            ListLabel label = paragraph.getListLabel();
            labelvalue = label.getLabelValue();
            priviousListLevel = labelvalue;
            break;
        }
    }
    pageDoc.save(MyDir + "Out_"+page+".html", saveOp);
}

Hi Tahir.Manzoor

Here is our test code for fixing problem about numbers of multi-level list:

@Test
public void testWithAsposeForList()
{
    try
    {
        Document doc = new Document("custom/input/docx/分項符號2.docx");
        Document pageDoc;
        LayoutCollector layoutCollector;
        DocumentPageSplitter splitter;
        ByteArrayOutputStream output = new ByteArrayOutputStream();
        HtmlSaveOptions saveOp = new HtmlSaveOptions();
        saveOp.setExportImagesAsBase64(true);
        saveOp.setExportTextInputFormFieldAsText(false);
        saveOp.setExportTocPageNumbers(true);
        saveOp.setExportPageSetup(true);
        saveOp.setExportDocumentProperties(true);
        saveOp.setExportRelativeFontSize(false);

        layoutCollector = new LayoutCollector(doc);
        doc.updatePageLayout();
        splitter = new DocumentPageSplitter(layoutCollector);

        byte[] outputContent;
        String outputPath = "custom/output/docx";
        String blockID = UUID.randomUUID().toString();

        Table<Integer, Integer, Integer> maxListLevelMap = HashBasedTable.create();
        for (int page = 1; page <= doc.getPageCount(); page++)
        {
            System.out.println("page:" + page);
            pageDoc = splitter.getDocumentOfPage(page);

            pageDoc.updateListLabels();
            Node[] nodes = pageDoc.getChildNodes(NodeType.PARAGRAPH, true).toArray();
            // which list level has been marked
            int listIdToCorrect = -1;
            int listLevelToCorrect = -1;
            for (Node node : nodes) {
    Paragraph paragraph = (Paragraph)node;
    ListFormat listFormat = paragraph.getListFormat();
    if (listFormat.isListItem() == false)
    {
        continue;
    }
    ListLabel listLabel = paragraph.getListLabel();
    int listId = listFormat.getList().getListId();
    int listLevel = listFormat.getListLevelNumber();
    Integer listLabelValue = maxListLevelMap.get(listId, listLevel);
    // save new list value
    listLabelValue = (listLabelValue == null ? listLabel.getLabelValue() : listLabelValue + 1);
    maxListLevelMap.put(listId, listLevel, listLabelValue);
    // save first list id and level
    if (listIdToCorrect < 0)
    {
        listIdToCorrect = listId;
        listLevelToCorrect = listLevel;
    }
    // reset list item’s startAt
    if (page > 1 && listIdToCorrect == listId && listLevel <= listLevelToCorrect && listLabelValue != null)
    {
        int strangeOffset = listLabel.getLabelValue() - 1;
        listFormat.getListLevel().setStartAt(listLabelValue - strangeOffset);
        System.out.println(strangeOffset);
        System.out.println("change~~~" + listId + ";" + listLevel + ";" + listLabelValue);
        // correct smaller level only at next
        listLevelToCorrect = listLevel - 1;
    }
}
System.out.println(maxListLevelMap);
output.reset();
pageDoc.save(output, saveOp);
outputContent = output.toByteArray();

File outputDir = new File(outputPath + "/" + blockID + "/");
if (!outputDir.exists())
    outputDir.mkdir();

IOUtils.write(outputContent, new FileOutputStream(outputPath + "/" + blockID + "/" + page + ".html"));

}
} catch (Exception e)
{
    e.printStackTrace();
}
}

This segment of code can generate correct html one page after another.
Within generating html for one page of this series, the list information is also generated from previous pages.

However we need a kind of method that can generating html for only one page directly in the middle of the document.

Is there a way to know the integers for the list to start at in specific page, without accessing the lists from the beginning of the document?
(For the optimization of speed and memory usage)

Craig

Hi Craig,

Thanks for your inquiry. The Aspose.Words.Layout namespace provides classes that allow to access information such as on what page and where on a page particular document elements are positioned, when the document is formatted into pages.

In this case, we suggest you following solution.

  1. Iterate through all sections of document.
  2. Get the paragraphs of a Section.
  3. Iterate through all paragraphs and use LayoutCollector.GetStartPageIndex method to get the page number of paragraph.
  4. Once you get the paragraph of your desired page number, get the list number as you are doing in your code.
  5. Extract the document’s page using PageSplitter utility.
  6. Use ListLevel.StartAt property to set the starting number for the list.

Hope this helps you. Please let us know if you have any more queries.

Hi Tahir.Manzoor

As far as I understand, If I need a single HTML page of page 10 in a 10 page Word document,
it is necessary and unavoidable to iterate all the paragraphs before page 10 to get the correct starting number for the list in page 10, right?

Craig

Hi Craig,

Thanks for your inquiry. Please note that MS Word document is flow document and does not contain any information about its layout into lines and pages. Therefore, technically there is no “Page”, “Line” concept in Word document. Pages and lines are created by Microsoft Word on the fly.

Yes, your understanding is correct about getting list number of a paragraph for specific page. Please let us know if you have any more queries.