Layout collector returning wrong page number in specific conditions

cornelcc · April 5, 2018, 12:54am

Hi,

I’m trying to figure out why the LayoutCollector is returning the wrong page number for a coupe of paragraphs.
Open the provided sample.docx in word and note that the page 2 contains the paragraphs starting with “Energy Financial Services revenues increased 37%”

In the code the Aspose.Words for .Net all the paragraphs in page 2, except last one, are returning 1 in place of 2 when calling LayoutCollector.GetStartPageIndex§.

Why is this behavior? This should be corrected, since the API should return the same page numbers as Word.

Note: Unfortunately I can’t change the input file (if I remove the Keep with next form a couple of paragraphs the word will change the layout, but Aspose.Words will still report in the wrong page)

Thank you,
Cornel T.

NoPage2.zip (566.4 KB)

awais.hafeez · April 5, 2018, 7:19am

@cornelcc,

But, when I open your “sample.docx” document with MS Word 2016, the target paragraph is being displayed on Page 1 (see screenshot).

Also, the latest version of Aspose.Words for .NET 18.4 mimics the way the Microsoft Word works. I converted your ‘sample.docx’ file to PDF format using Microsoft Word 2016 and attached the resultant file here for your reference. You can see that Aspose.Words 18.4 produces an output similar to Microsoft Word 2016. So, this seems to be an expected behavior. If we can help you with anything else, please feel free to ask.

Output PDF Documents: comparison.zip (180.6 KB)

cornelcc · April 6, 2018, 12:31pm

Thanks for the quick response. If you click on “Show/Hide paragraph markers and other hidden formatting symbols.” some extra paragraphs in yellow are displayed. I understand your comment and you are right. The snippet was form a larger document and I’m trying to isolate the problem and send it soon.

cornelcc · April 6, 2018, 8:31pm

Please see attached a new sample, with 2 new problems.
Problem 1. Paragraphs are not correctly identified in pages.
See the Image1 and Image 2 (The values returned by LayoutCollector.GetStartPageIndex are not correct, word shows the pages differently.)

Problem 2. Consecutive paragraphs are flagged in different pages, see the Red text in the output in console app ( A paragraph is identified as part of page 8 but is surrounded by a couple of paragraphs appearing in page 7)

In essence all I need is to detect the start pages in a word document, and I’m trying to go over all the paragraphs in the document and detect a new page. If there is a faster or better way to detect that, please let me know.
InconsistentPageNoInTables.zip (535.4 KB)

cornelcc · April 6, 2018, 8:34pm

images.zip (85.7 KB)

awais.hafeez · April 7, 2018, 4:04am

@cornelcc,

We are working over your query and will get back to you soon.

awais.hafeez · April 7, 2018, 12:45pm

@cornelcc,

We have logged this problem in our issue tracking system. The ID of this issue is WORDSNET-16672. We will further look into the details of this problem and will keep you updated on the status of this issue. We apologize for your inconvenience.

cornelcc · April 10, 2018, 12:07pm

awais.hafeez,

This is a very important defect for us and we would like to have a resolution as soon as possible.

Thank you,

Cornel T.

awais.hafeez · April 10, 2018, 10:42pm

@cornelcc,

Unfortunately, your issue is not resolved yet. We are currently doing analysis on this issue to determine the root cause of this problem. Once the analysis of this issue is completed, we may then be able to share estimates (ETA) with you. We will keep you posted on further updates and let you know when this issue is resolved. We apologize for any inconvenience.

awais.hafeez · April 12, 2018, 2:48am

@cornelcc,

Regarding WORDSNET-16672, we have completed the work on your issue and have come to a conclusion that this issue is actually not a bug in layout collector of Aspose.Words. So, we will close this issue as ‘Not a Bug’. Please see the following details:

Aspose.Words generated PDF output is equal to MS Word generated PDF output. So, LayoutCollector returns correct page numbers. It is being confused because MS Word UI Layout is NOT even equal to MS Word PDF output (see where first page ends in msw-2016.pdf (948.5 KB)).

We ve analyzed MS Word’s behavior and it seems MS Word has a bug when a table cell contains only SDT control and a cell break is placed outside this SDT control. Please see attached sdt-in-table-test.zip (34.9 KB).

We are closing this issue because mainly Aspose.Words’ output should be equal to MS Word’s PDF output (printed output).

Hope, this explains the situation when MS Word improperly calculate row height in UI layout.

cornelcc · April 12, 2018, 7:48pm

awais.hafeez,

This defect should be reopen. The resolution does not make any sens, I’m using the API to calculate the position of a paragraph in the document. The output to pdf it’s not my concern, I’m not converting the file to PDF, I did not mention any any point that I’m interested in doing so.

The problem is within LayoutCollector, in the sample document I provided, there are few things wrong. If you just call GetStartPageIndex and GetEndPageIndex on first paragraph you will see that the values are 1 (start page) and 8 (end page). How is that possible?!! the paragraph has only few words on it and starts and ends on Page 1.

If you further process the paragraphs inside the table you are going to see the other 2 problems I already mentioned.

If this is not a problem in the API and the Layout Collector works as expected, please suggest me a way to avoid this issue. How ca I reliably detect when a page starts and where a paragraph starts and ends in a page?

awais.hafeez · April 13, 2018, 10:05am

@cornelcc,

Thanks for the additional information. We are looking into the feasibility of implementing the MS Word’s UI layout rather than MS Word’s PDF (printed) output. We will keep you informed of any further updates. We apologize for any inconvenience.

awais.hafeez · April 18, 2018, 8:20am

@cornelcc,

The layout collector class merely collects document object properties when layout model is built. It does nothing else. The layout model of the document represents where objects of the document appear in the fixed page output, like in PDF, XPS or image formats.

Even though you did not explicitly request conversion to PDF, it still does happen behind the scenes, because layout of the document must be built in order to answer the question “where that object is”.

Whenever Aspose.Words is requested to return information about pages or any objects appearing on pages of the document, layout model of the document is built. What this layout model has inside can be visualized by exporting it into fixed pages formats, like PDF, XPS or bimaps.
Thus it is not possible to tell where paragraph appears on a page unless document layout is built.
Layout collector does nothing but retrieves information from the layout model. Thus it cannot return “wrong” information by design. It does not compute any information, just gets it from the layout model. The only possible kind of error which could happen in layout collector is if it fails to return matching what layout model has.
Thus if layout model has paragraph on a specific page then layout collector must report this paragraph is indeed on that page.
Aspose.Words is built to mimic behavior of the latest (current) version of the Microsoft Office Word application in regards to exporting of PDF. Whatever PDF output Microsoft Word generates for a given document is what Aspose.Words tries to replicate. In some cases when PDF export function of the Microsoft Word generates output inferior to the document representation in User Interface, we may try to mimic UI behavior, but this is extremely rare and is only as a workaround for bugs in the PDF export in Microsoft Word.

Given the answers above and what has been found previously about the PDF export of the subject document, we consider Aspose.Words and layout collector behavior is correct and there is no bug to fix here.

Please understand that Aspose.Words generates output which matches what Microsoft Word generates in PDF export and that according to this output page number reported for a paragraph is correct.

cornelcc · April 24, 2018, 3:48am

awais.hafeez,

I appreciate that you took the time to explain how LayoutCollector works under the hood, but I still believe this a problem. Regardless of the internals of layoutcollector output, it behaves correctly in most cases, but in some of instances like in the sample I provided does NOT return the proper values.

It does not make sense for a paragraph present in a table to have the page number of next page and following paragraphs to have a smaller page number.

awais.hafeez · April 24, 2018, 12:50pm

@cornelcc,

We are checking this scenario further and will let you know if we have anymore updates for you.

awais.hafeez · August 31, 2018, 10:11am

@cornelcc,

Considering this as input Word document (see sample.zip (530.2 KB)), we opened this with MS Word 2016 and converted (Save As) to PDF (see msw-2016.pdf (719.1 KB)). We also converted the Sample.docx to PDF format by using the latest version of Aspose.Words for .NET i.e. 18.8 on our end (see 18.8.pdf (174.8 KB)). Now, please open these PDF files side by side to see what the first lines in each Page are. We found that each Page in both PDF starts with the same content. So in this case, there is no difference in how MS Word 2016 and Aspose.Words for .NET 18.8 both render this Word document to PDF.

Now, we have written following small piece of code to detect which one is the first paragraph on each Page?

Document doc = new Document("D:\\Temp\\sample.docx");

Node[] runs = doc.GetChildNodes(NodeType.Run, true).ToArray();
for (int i = 0; i < runs.Length; i++)
{
    Run run = (Run)runs[i];
    int length = run.Text.Length;

    Run currentNode = run;
    for (int x = 1; x < length; x++)
    {
        currentNode = SplitRun(currentNode, 1);
    }
}

NodeCollection smallRuns = doc.FirstSection.GetChildNodes(NodeType.Run, true);
LayoutCollector collector = new LayoutCollector(doc);

int pageIndex = 1;
foreach (Run run in smallRuns)
{
    if (collector.GetStartPageIndex(run) == pageIndex)
    {
        Console.WriteLine($"First Paragraph in Page # {pageIndex} is --> {run.ParentParagraph.ToString(SaveFormat.Text)} ");

        pageIndex++;
    }
}

The code identifies the first Paragraphs on all Pages correctly as can be compared with msw-2016.pdf. But all of this do not match to what is shown in MS Word 2016 editor. We believe this is an expected behavior of Aspose.Words for .NET. Also, please note that Aspose.Words for .NET’s output may not always match to what is shown in MS Word editor. Aspose.Words for .NET’s Layout engine reports what is inside of it, not what is inside of a MS Word editor view. Thus, you need to output the document to PDF before you analyze it with layout collector and then compare results to the PDF output rather than to what MS Word renders in its editor. Hope, this helps.

awais.hafeez · July 26, 2019, 4:58pm

A post was merged into an existing topic: Extract all formatted content from a word document which has track changes