Paragrahabsorber and page rotation and odd line breaks

jesper-1 · June 25, 2018, 7:36pm

There seem to be some issues with the nice new ParagraphAbsorber

on rotated pages.

The flow reported for the same page stored as 0, 90, 180, and 270 degree in a text document is different.
Especially for 180 and 90 degrees clockwise where the paragraphs in each section as well as the lines in each paragraph are returned in the opposite order.

So rather than the straight forward

foreach (MarkupParagraph paragraph in section.Paragraphs) {
    foreach (List<TextFragment> line in paragraph.Lines) {

We currently need something like this to swap it…

int YOrderDelta = (int)pag.Rotate == 0 || (int)pag.Rotate == 3 ? +1 : -1;
int parMax = section.Paragraphs.Count-1;
for (int ixP= parMax * (1 - YOrderDelta) / 2; 0<=ixP && ixP<=parMax; ixP+=YOrderDelta) {
    MarkupParagraph paragraph = section.Paragraphs[ixP];
    int linMax = paragraph.Lines.Count-1;
    for(int ixL= linMax * (1 - YOrderDelta) / 2; 0<=ixL && ixL<=linMax; ixL+=YOrderDelta) {
        List<TextFragment> line = paragraph.Lines[ixL];

Even with this swap the flow is not entirely the same.

PS it would give cleaner code if there were a “Line” class rather than the List

The logic flow sometimes reports new section or paragraph on the same line, sometimes even what is clearly the same word, so currently quite a bit of post-processing of the flow is needed.

imran.rafique · June 25, 2018, 8:27pm

@jesper-1,

Please send us source PDF document and the complete code of each test case. We will investigate your scenarios in our environment, and share our findings with you.

jesper-1 · June 25, 2018, 8:43pm

Thanks for the quick reply. Here is a sample PDF-file (with gibberish text)PDF_Insight_ROT_test.pdf (259.2 KB)

and some more elaborated sample code

ParagraphAbsorber absorberP2 = new ParagraphAbsorber();
absorberP2.Visit(pag);
string res = "";
foreach (PageMarkup markup in absorberP2.PageMarkups) {
    foreach (MarkupSection section in markup.Sections) {
        //foreach (MarkupParagraph paragraph in section.Paragraphs) {
            //foreach (List<TextFragment> line in paragraph.Lines) {
        int YOrderDelta = (int)pag.Rotate == 0 || (int)pag.Rotate == 3 ? +1 : -1;
        int parMax = section.Paragraphs.Count-1;
        for (int ixP= parMax * (1 - YOrderDelta) / 2; 0<=ixP && ixP<=parMax; ixP+=YOrderDelta) {
            MarkupParagraph paragraph = section.Paragraphs[ixP];
            int linMax = paragraph.Lines.Count-1;
            for(int ixL= linMax * (1 - YOrderDelta) / 2; 0<=ixL && ixL<=linMax; ixL+=YOrderDelta) { 
                List<TextFragment> line = paragraph.Lines[ixL];
                foreach (TextFragment tFrag in line) {
                    foreach (TextSegment tSegm in tFrag.Segments) { 
                        //for (int gi = 0; gi < tSegm.Characters.Count; gi++) {
                        //    process each char....
                        //}
                        res += tSegm.Text;
                        res += "<eoSeg>";
                    }
                    res += "<eoFrag>";
                }
                res += "\r\n";
            }
            res += "<Para>\r\n";
        }
        res += "<Section>\r\n";
    }
    res += "<Markup>\r\n";
}

I haven’t got a PDF-file I can share with the second issue.

jesper-1 · June 25, 2018, 9:58pm

Also note three columns say A B C (left to right) are returned as sections in the order B A C. in this sample
PDF_Test_from_word_v3.pdf (174.3 KB)

(the issues are not seen with the TextAbsorber BTW)

Suggestion on coordinate system:
I know I can apply the reverse PageMatrix on the segments coordinates, to get the logic coordinates on the page. It would be great with a flag similar to GetPageRect(…) for having the coordinates in either document- or within-page- coordinates. As it is so much easier to post-process text if it always appear in a coordinate system with increasing X along the lines, and falling Y between the lines, no matter the page orientation.

imran.rafique · June 26, 2018, 6:28am

@jesper-1,

We have tested your scenario with latest version 18.6 of Aspose.PDF for .NET API, and unable to replicate the said issues in our environment. Please send us the complete code and highlight the problematic regions with the help to snapshots.

jesper-1 · June 26, 2018, 8:44am

Note that the code was WITH the workaround included (and this fixes the page rotation issues - though NOT the text that is rotated ON the page) you get substantially different output for pages 5,6,7 and 8

To see the issue simplify and REMOVE the workaround. and compare page 5-8.
(here with carats replaced by brackets as the forum do not like carats)

ParagraphAbsorber absorberP2 = new ParagraphAbsorber();
absorberP2.Visit(pag);
string res = "";
foreach (PageMarkup markup in absorberP2.PageMarkups) {
    foreach (MarkupSection section in markup.Sections) {
        foreach (MarkupParagraph paragraph in section.Paragraphs) {
            foreach (List[TextFragment] line in paragraph.Lines) {
                foreach (TextFragment tFrag in line) {
                    foreach (TextSegment tSegm in tFrag.Segments) { 
                        //for (int gi = 0; gi [ tSegm.Characters.Count; gi++) {
                        //    process each char....
                        //}
                        res += tSegm.Text;
                        res += "[eoSeg]";
                    }
                    res += "[eoFrag]";
                }
                res += "\r\n";
            }
            res += "[Para]\r\n";
        }
        res += "[Section]\r\n";
    }
    res += "[Markup]\r\n";
}

Here lines for the first block (first 4 output lines skipped) from page 5 and 6
Note that it on page 6 comes as three paragraphs in reverse order, and that the lines in the paragraphs are reversed too…

Page 5:

[Para]
[Section]
Aliquam iaculis vehicula sapien nec aliquam. [eoSeg][eoFrag]
Nullam sem libero, posuere vitae erat [eoSeg][eoFrag]
pellentesque, vehicula tincidunt leo. Cras metus [eoSeg][eoFrag]
nibh, scelerisque non posuere nec, suscipit id [eoSeg][eoFrag]
leo. Nunc vehicula augue a quam lacinia, ac [eoSeg][eoFrag]
porttitor urna tempus. Phasellus nec elit ut dui [eoSeg][eoFrag]
placerat sodales sed id elit. Donec varius ac [eoSeg][eoFrag]
neque eu luctus. Etiam dictum quis nunc et [eoSeg][eoFrag]
ultricies. Donec rutrum maximus semper.[eoSeg][eoFrag]
[Para]
[Section]

Page 6
[Para]
[Section]
ultricies. Donec rutrum maximus semper.[eoSeg][eoFrag]
neque eu luctus. Etiam dictum quis nunc et [eoSeg][eoFrag]
placerat sodales sed id elit. Donec varius ac [eoSeg][eoFrag]
porttitor urna tempus. Phasellus nec elit ut dui [eoSeg][eoFrag]
leo. Nunc vehicula augue a quam lacinia, ac [eoSeg][eoFrag]
nibh, scelerisque non posuere nec, suscipit id [eoSeg][eoFrag]
pellentesque, vehicula tincidunt leo. Cras metus [eoSeg][eoFrag]
[Para]
Nullam sem libero, posuere vitae erat [eoSeg][eoFrag]
[Para]
Aliquam iaculis vehicula sapien nec aliquam. [eoSeg][eoFrag]
[Para]
[Section]

imran.rafique · June 27, 2018, 2:34am

@jesper-1,

With the latest version 18.6 of Aspose.PDF for .NET API, the output PDF documents look fine. This is the ZIP of the output PDF documents: OutputPDFs.zip (308.1 KB). If you could view the problem in your environment, then please highlight with the help of a snapshot.

jesper-1 · June 27, 2018, 5:06am

@imran.rafique
It seems like you have zipped the input files and not the output???

(And I AM using latest 18.6 of Jun 22).

jesper-1 · June 27, 2018, 8:47am

I have made an improved workaround that catches more, And is simpler to read.

First a helper extension for managing the direction of a list

public static IEnumerable<T> Directional<T>(this IList<T> items, bool Forwards) {
    if (Forwards) foreach (T item in items) yield return item;
    else for (int i = items.Count-1; 0<=i; i--) yield return items[i];
}

And then I calculate the LL to LR corner vector of the first paragraph in each section, and use that to determine what order the next THREE levels should be traversed in, Note the .Directional(forwards):

foreach (PageMarkup markup in absorberP.PageMarkups)
    foreach (MarkupSection section in markup.Sections) {
        if (section.Paragraphs.Count == 0) continue; //Academic...
        var p0 = (section.Paragraphs[0].Points[0]);
        var p1 = (section.Paragraphs[0].Points[1]);
        double vX = p1.X - p0.X; //LL to LR vector (Use paragraph, as only offered)
        double vY = p1.Y - p0.Y;
        bool forwards = -0.00001<vX && -0.00001<vY;
        foreach (MarkupParagraph paragraph in section.Paragraphs.Directional(forwards)) {
            foreach (List<TextFragment> line in paragraph.Lines.Directional(forwards)) {
                foreach (TextFragment tFrag in line.Directional(forwards)) {
                    .
                    .

Apart from that, the ParagraphAbsorber seems to have some issues MIXING different sections of the text, as TextFragments of same line of paragraph.Lines even if the blocks got substantially different text-rotation (The two slanted blocks of text at the bottom of page 5)
I guess it is because they overlap in the Y-direction of the page?
Page 8 is quite similar to page 5,
On page 6 and 7 it is a bit worse. Only part of the bottom slanted text blocks text are returned at all??

imran.rafique · June 27, 2018, 3:14pm

@jesper-1,

This is the ZIP of output PDF documents. We could not track the problematic area, and the output and input PDF documents are identical. We can also view overlapping of text in your input PDF documents. Please do not incorporate workarounds, and create a small application project, which reproduces the said problems in your environment, and then send us a ZIP of this project. Please also highlight the problematic area with the help of a snapshot. Your response is awaited.

jesper-1 · June 27, 2018, 7:25pm

But the output that illustrates the problem is a text string, not a pdf…

jesper-1 · June 27, 2018, 8:42pm

Here it is, in a simplified spelled out project
Aspose_ParagraphAbsorber_Issue.zip (144.5 KB)

Unzip the file
Add Aspose 18.6.1 nuget package-files
Open project in Visual Studio
Compile
Add licence file
Run the program with the pdf as a parameter
Compare the OUTPUT textfile 1.txt to 4.txt and 5.txt to 8.txt

Just to be sure to avoid further confusion I attached the output files too…
Output.zip (4.6 KB)

imran.rafique · June 28, 2018, 3:04am

@jesper-1,

Thank you for the sample project. We managed to replicate the said issues in our environment. The investigations have been logged into our issue tracking system as follows:

File name: PDF_Insight_ROT_test.pdf
PDFNET-44979: ParagraphAbsorber - Incorrect retrieval of the text

File name: PDF_Test_from_word_v3.pdf
PDFNET-44980: ParagraphAbsorber - Incorrect retrieval of the text

We have linked your post to these tickets and will keep you informed regarding any available updates.

jesper-1 · August 31, 2018, 1:55am

I do not want to appear too impatient, but it has been over two months. Any news on this bug?

Farhan.Raza · August 31, 2018, 10:01am

@jesper-1

Thank you for getting back to us.

We would like to update you that both of the tickets are still pending for investigations owing to previously logged tickets and can take some more months to resolve. We will let you know as soon as they will be resolved.

However, we also offer Paid Support, where issues are used to be investigated with higher priority. Our customers, who have paid support subscription, report their issue there which are meant to be investigated urgently. In case your reported issue is a blocker, you may please consider subscribing for Paid Support. For further information, please visit Paid Support FAQs.