Aspose.Words Reading Different/Incorrect Values for Numbered Lists (Numbering "Arbitrarily" Restarts)

jjoconnell · February 10, 2022, 7:35pm

Our application uses the Aspose products to ingest different document types and then utilize the contents. We have recently encountered an issue where Microsoft Word .DOCX files contain numbered lists - and everything looks normal in MS Word. However when ASPOSE parses out the text, the numbers for the list items are different. It seems to randomly start over at 1. For example in Word you’ll see:
1.
2.
3.
4.
5.
6.

But the content that we end up with is:
1.
2.
3.
4.
1.
2.

There does not seem to be any pattern to it except that it appears to happen when a numbered list spans multiple pages - the restart seems to occur on the second to last item that appears on a given page when viewing it in MS Word.

Wondering if anyone has encountered this issue - and if so, if you’ve been able to resolve it or at least lock down the root cause. We’ve tried editing formats in Word and a variety of documents, and the results are always the same.

Thanks!

sergey.lobanov · February 11, 2022, 12:43am

@jjoconnell,
Could you please ZIP and attach a sample problem input document which could help us reproduce the problem? We will investigate the issue and provide you information on it.

jjoconnell · February 14, 2022, 5:11pm

Hello @sergey.lobanov - please see attached. Running this DocX, which contains a single numbered list consisting of items 1 - 28, ends up as items 1 - 9, then 1 -9, then 2 -10 in the data that we end up with. restarting_word_doc_numbering.zip (33.8 KB)

sergey.lobanov · February 14, 2022, 8:37pm

@jjoconnell,
Unfortunately, we were unable to reproduce the issue on our side. The document you provided looks fine on my side (with list items numbered as 1-28).

Also, when trying to check the list labels numbering using Aspose.Words, it also returns, that the numbering of items is 1-28. Please check the following code example:

Document doc = new Document("restarting_word_doc_numbering.docx");
doc.UpdateListLabels();

foreach (Paragraph p in doc.GetChildNodes(NodeType.Paragraph, true))
    if (p.IsListItem)
        Console.WriteLine(p.ListLabel.LabelString);

The output:

1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.

Could you please share all steps that you are following to reproduce the same issue at our end? Thanks for your cooperation.

jjoconnell · February 15, 2022, 7:28pm

We were able to track this down a little bit further. Our end goal is to extract the page text, while keeping the page numbers. So if the document had 20 pages, we’d get a string array with 20 entries

The example code below is essentially what we we’re trying to do. It seems like the issue lies in saving the extracted page (to a stream, but presumably any output would have the same problem). When there’s a bullet list that spans more than one page, subsequent page will, sometimes, restart or change the numbering

                var pages = new List<string>();
                for (var page = 0; page < doc.PageCount; page++)
                {
                    var extractedPage = doc.ExtractPages(page, 1);

                    using (var stream = new MemoryStream())
                    {
                        extractedPage.Save(stream, SaveFormat.Text);

                        var content = Encoding.UTF8.GetString(stream.ToArray()).Trim('\r', '\n');

                        pages.Add(content);
                    }
                }

The “why” makes sense - each page is it’s own document, and probably assumes it has to be the one to start the numbering of the bullet list. But for the purposes of extracting the text, in which we need those bullet numbers to be accurate across page breaks, it becomes an issue

We’re open to any other alternatives you might have to extract the text for each page for the document. We were originally using a PageSplitter class that was found on your forums a while back. After troubleshooting this issue, we checked for updates to the library and saw ExtractPages was now an option. Unfortunately, the issue still remains

sergey.lobanov · February 15, 2022, 9:31pm

@jjoconnell,
Unfortunately, we still weren’t able to reproduce the issue on our side using the latest 22.2 version of Aspose.Words. The numbering of items is saved on extracted pages. Please check the following code and our generated output:

Document doc = new Document("restarting_word_doc_numbering.docx");

var pages = new List<string>();
for (var page = 0; page < doc.PageCount; page++)
{
    var extractedPage = doc.ExtractPages(page, 1);

    using (var stream = new MemoryStream())
    {
        extractedPage.Save(stream, SaveFormat.Text);
        extractedPage.Save(page + "_page.docx");

        var content = Encoding.UTF8.GetString(stream.ToArray()).Trim('\r', '\n');

        pages.Add(content);
    }
}

foreach (string p in pages)
    Console.WriteLine(p);

22.2_output.zip (84.2 KB)

Could you please share your extracted output pages, generated on your side? Also, could you please tell which version of Aspose Words do you use in your solution?

jjoconnell · February 17, 2022, 9:55pm

We are using version 22.2.0. Attached is the output we were getting (with dashes between pages). After some further troubleshooting, we were able to resolve the issue. The code above did work successfully, and kept the bullet number ordering correct.

The problem was with some stale code left from the original implementation. We were using a prior version of the library that did not include the ExtractPages method. Instead, we were utilizing code from a sample project found on your forums called PageSplitter (https://github.com/aspose-words/Aspose.Words-for-.NET/blob/master/Examples/CSharp/Loading-and-Saving/PageSplitter.cs). That sample project is no longer available, but at the time, it was the best way to retain page numbers when extracting the document text. Apparently it had the bug with the bullet numbering mentioned in the original post.

We still had a reference to the class in that code running prior to using the ExtractPages method, and it was, unbeknownst to us, modifying the original document. By the time ExtractPages was run, the numbering was already wrong, and that’s why we didn’t spot the issue earlier

So long story short, our issue was solved using the code in this thread (ExtractPages). And anyone else using the PageSplitter code should update to a new version of the library and use ExtractPages instead

Thanks again for your help troubleshooting the issue and your timely responses.

output.zip (2.8 KB)

sergey.lobanov · February 19, 2022, 1:20am

@jjoconnell,
Thank you for reporting. It is great that you managed to solve your issue. Please feel free to ask in case of any other problems. We will be glad to help you.

jjoconnell · February 28, 2022, 6:03pm

@sergey.lobanov Sorry to open this back up again, but the issue has returned. The prior solution did work for all of the test cases we had at the time, but here is a new one that’s causing the line numbers to be off again

I’ve attached a new sample file and sample code that we’re using. All it does is extract the entire document at once, and then again one page at a time. There results of both are also attached. The only difference, besides some whitespace, is the line numbers
image.png (20.7 KB)

This is using the ExtractPages method that was talked about before. Its not shown in the code, but its done immediately after loading the file; nothing else is done prior to extracting the text.

We are using v22.3.0, which looks like it was just released (and the issue is still there). In the release notes is this line “WORDSNET-23469 Issue with Document.ExtractPages(…)”; can you elaborate on what that fix was since it might related to what we’re talking about here?

AsposeLineNumbers.zip (21.9 KB)

sergey.lobanov · February 28, 2022, 11:01pm

@jjoconnell,
Thank you for reporting this problem to us. It seems that the problem is related to the fact, that the list items on the third page of your input document are located in the table.
For the sake of correction, we have logged this problem in our issue tracking system as WORDSNET - 23538. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

aspose.notifier · April 1, 2022, 3:53pm

The issues you have found earlier (filed as WORDSNET-23538) have been fixed in this Aspose.Words for .NET 22.4 update also available on NuGet.

jjoconnell · April 7, 2022, 2:55pm

I upgraded to version 22.4 and while it did correct the sample file we sent, the first real file that I ran through still had an incorrect line number.

Can you:

Let us know what the fix involved, in case its relevant to our test file?
Provide us a location where we can upload our real file, since we can not upload it to a public forum?

alexey.noskov · April 7, 2022, 3:14pm

@jjoconnell The fix involved support for updating ListLevel.StartAt values in case numbering continues in the table.
It is safe to attach documents in the forum only you and Aspose staff can download them. Also, you can send the document via private message. Just click my login and then press message:

jjoconnell · April 8, 2022, 2:27pm

Thanks for providing the details. Our problem file is attached:
Example.zip (96.0 KB)

The issue is on pages six and 7. There’s a list that starts on page six, going from 1 - 4. Then on page seven, it continues to 5, but in the output its showing up as 1

alexey.noskov · April 8, 2022, 3:33pm

@jjoconnell Thank you for additional information. The problem has been logged as WORDSNET-23718. We will keep you informed and let you know once it is resolved.

aspose.notifier · May 4, 2022, 1:16pm

The issues you have found earlier (filed as WORDSNET-23718) have been fixed in this Aspose.Words for .NET 22.5 update also available on NuGet.

jjoconnell · May 17, 2022, 5:17pm

Thank for the fix. I can confirm it is working properly in our environment as well, and we can go ahead and close out this thread. Thanks again!