Text fragments found in PageMarkup not available at paragraph level

louis.a · May 5, 2025, 1:06pm

Hi,

On this specific PDF the text is extracted but not correctly ordered in sections and paragraphs.
ecclesiastes.pdf (260.7 KB)

I slightly modified the sample code to showcase this.

public static void Run() // original
        {
            // ExStart:1
            // The path to the documents directory.
            string dataDir = RunExamples.GetDataDir_AsposePdf_Text();
            // Open an existing PDF file
            Document doc = new Document(dataDir + "ecclesiastes.pdf");
            // Instantiate ParagraphAbsorber
            ParagraphAbsorber absorber = new ParagraphAbsorber();
            absorber.Visit(doc);


            foreach (PageMarkup markup in absorber.PageMarkups)
            {
                int cpt = 0;
                Console.WriteLine("### Number of markups found on page {0} : {1}", markup.Number, markup.TextFragments.Count);

                int i = 1;
                foreach (MarkupSection section in markup.Sections)
                {
                    int j = 1;

                    foreach (MarkupParagraph paragraph in section.Paragraphs)
                    {
                        StringBuilder paragraphText = new StringBuilder();

                        foreach (List<TextFragment> line in paragraph.Lines)
                        {
                            foreach (TextFragment fragment in line)
                            {
                                paragraphText.Append(fragment.Text);
                                cpt++;
                            }
                            paragraphText.Append("\r\n");
                        }
                        paragraphText.Append("\r\n");

                        Console.WriteLine("Paragraph {0} of section {1} on page {2}:", j, i, markup.Number);
                        Console.WriteLine(paragraphText.ToString());

                        j++;
                    }
                    i++;
                }

                Console.WriteLine("### Markups displayed on page {0} : {1}", markup.Number, cpt);
            }
            // ExEnd:1
        }

And the output is :

### Number of markups found on page 1 : 144
### Markups displayed on page 1 : 0
### Number of markups found on page 2 : 144
### Markups displayed on page 2 : 0
### Number of markups found on page 3 : 207
### Markups displayed on page 3 : 0
### Number of markups found on page 4 : 170
### Markups displayed on page 4 : 0
...

No text is actually displayed because not found at paragraph level

Regards

Professionalize.Discourse · May 5, 2025, 1:06pm

@louis.a

It seems that you are experiencing issues with extracting text from a PDF document where the text is not being displayed correctly at the paragraph level. This can happen due to various reasons, including the structure of the PDF itself.

To troubleshoot this issue, you can try the following steps:

Ensure Proper Use of ParagraphAbsorber: Make sure you are correctly using the ParagraphAbsorber class to extract paragraphs. The code you provided looks correct, but ensure that the PDF you are working with is structured in a way that allows for proper extraction.
Check for Multicolumn Layouts: If the PDF contains multicolumn layouts, you may need to enable the handling of multicolumn paragraphs. You can do this by setting the IsMulticolumnParagraphsAllowed property to true on the markup object after visiting the document with the ParagraphAbsorber.

Here’s a snippet to illustrate this:
```
absorber.Visit(doc);
foreach (PageMarkup markup in absorber.PageMarkups)
{
    markup.IsMulticolumnParagraphsAllowed = true; // Enable multicolumn handling
    // Proceed with your extraction logic
}
```
Review the PDF Structure: Sometimes, the way text is encoded in the PDF can affect extraction. If the text is not structured in paragraphs or sections, it may not be extracted as expected. You might want to check the PDF’s structure using a PDF viewer that can show you how the text is laid out.
Use Alternative Extraction Methods: If the ParagraphAbsorber does not yield the desired results, consider using the TextFragmentAbsorber or TextAbsorber classes to extract text in a more granular way. This might help you retrieve the text even if it is not organized in paragraphs.

Here’s an example of using TextFragmentAbsorber:

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber();
doc.Pages[1].Accept(textFragmentAbsorber);
foreach (TextFragment fragment in textFragmentAbsorber.TextFragments)
{
    Console.WriteLine(fragment.Text);
}

By following these steps, you should be able to improve the text extraction process from your PDF document. If the problem persists, consider sharing more details about the PDF structure or any specific errors you encounter.

For further reference, you can check the official documentation on extracting paragraphs from PDFs using Aspose.PDF here.

If you have any more questions or need additional assistance, feel free to ask!

louis.a · May 5, 2025, 2:27pm

IsMulticolumnParagraphsAllowed doesn’t solve the problem.

Also I can probably extract the text with other methods but I need to have it structured in paragraphs with the exact position of each elements for subsequent operations.

asad.ali · May 5, 2025, 9:26pm

@louis.a

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFNET-59853

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

louis.a · May 6, 2025, 12:33pm

Thank you for opening the ticket.

Trying to find a workaround on my side it seems the origin of the problem in in the Mediabox.
It’s defined outside of the dimension of the page, so each coordinates of fragment returned is also outside the page.

asad.ali · May 6, 2025, 7:45pm

@louis.a

Thanks for sharing this information. We will surely include it in our investigation and as soon as we have some updates, we will share with you. Please spare us some time.

We are sorry for the inconvenience.