PDF - Identify & Extract - Table and MultiColumn At Page Level

Sudharsann01 · December 4, 2020, 6:49pm

We are evaluating Aspose.Pdf (Version 20.11.0) for generating JSON from PDF. We have written a custom parser on top of Aspose to generate JSON since there is no out of box support for it yet. One of the use case is to parse multi-column PDF and ones that contain table in them.
Below are my queries in this regard.

How do we identify at a page level whether it is multi-column or single column ? The reason we want to know at page level is to apply TextFragmentAbsorber for a particular page and not all.
Is there a way to identify whether a page contains table in it ? Again to apply TableAbsorber only to that particular page.
Based on a bookmark location on a page , is there a way to extract text corresponding to that bookmark ? Say, I have a bookmarks like below
1. Test
2. Another Test
  I want to extract the text between bookmark “Test” and “AnotherTest” as that text corresponds to the bookmark “Test”. Similarly for the rest of the bookmarks.

asad.ali · December 6, 2020, 8:26pm

@Sudharsann01

We have observed your requirements and will surely test the scenario in our environment in order to determine whether all of your cases can be satisfied or not. Would you kindly share a sample PDF document that fits all of the cases above. We will try to prepare a sample code snippet for you and share it with you.

Sudharsann01 · December 7, 2020, 2:41pm

Hi Asad,
I have attached a sample PDF. Do let me know if you need any more information.

CBRE Employee_Handbook_For_EHA.pdf (1.7 MB)

asad.ali · December 7, 2020, 10:40pm

@Sudharsann01

Thanks for sharing the sample PDF.

Please check following code snippet which can extract text and table data separately from every page. In case the page does not contain any table, the table data would be empty:

Document pdfDocument = new Document(dataDir + "CBRE Employee_Handbook_For_EHA.pdf");
TextFragmentAbsorber tfa = new TextFragmentAbsorber();

TableAbsorber ta = new TableAbsorber();
PageCollection pc = pdfDocument.Pages;
string tempTable = "";
string text = "";
foreach(Page page in pc)
{
 text = "";
 ta.Visit(page);
 tfa.Visit(page);

 foreach(TextFragment tf in tfa.TextFragments)
 {
  text += tf.Text;
 }

 foreach(AbsorbedTable table in ta.TableList)
 {
  tempTable = "";
  foreach(AbsorbedRow row in table.RowList)
  {
   foreach(AbsorbedCell cell in row.CellList)
   {
    foreach(TextFragment tf in cell.TextFragments)
    {
     tempTable += tf.Text;
    }
   }
  }
 }
 if (tempTable.Length > 0)
  text = text.Replace(tempTable, "");
 Console.WriteLine("Page No: " + page.Number);
 Console.WriteLine("Table Data:");
 Console.WriteLine(tempTable);
 Console.WriteLine("Text Only Data:");
 Console.WriteLine(text);
}

Regarding your requirement related to bookmarks, we already have logged an investigation ticket as PDFNET-48387 in our issue tracking system for the sake of implementation. We have linked it with this forum thread so that you will be able to receive a notification as soon as it is available. Please be patient and spare us some time.

We are sorry for the inconvenience.

Sudharsann01 · December 7, 2020, 10:46pm

Hi Asad,
Thanks for the information.

We are able to retrieve tables from a page if it indeed contains a table.

What we are looking is a way to identify whether there is a table in a page before applying TableAbsorber class on it. Since, we are working with PDF’s that are more than 100 pages.
You didn’t mention about whether we can identify a page is single column or multi column. Are there any property on page level that we will tell us this ? Can you let us know on this ?

asad.ali · December 8, 2020, 7:54pm

@Sudharsann01

Aspose.PDF offers TableAsorber class which extracts tables from PDF and in case PDF does not have any tables, it will return Null against TableList property. At the moment, it is obvious that we cannot determine the table existence without using TableAbsorber Class. This is why we offered a workaround to extract the table and text separately.

Nevertheless, we have logged an investigation ticket as PDFNET-49155 in our issue tracking system to further analyze whether it is possible or not. We will further let you know as soon as the ticket is resolved.

In older Aspose.Pdf.Generator model, there used to be such property to determine whether a section is divided into multiple columns or not. However, later the functionality was replaced with tables in new DOM to add text in columns at the time of PDF generation. Another ticket as PDFNET-49156 has been logged in our issue tracking system for your this requirement.

We will look into the details of both logged tickets and keep you posted with the status of their resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

Sudharsann01 · December 8, 2020, 10:03pm

Is there work around for the second issue of multi-column PDF ? We would like to identify those and extract the text in the same format as it appears in the PDF. Since, all pages in a PDF need not to be multi-column , I assume that we shouldn’t apply the TextFragmentAbsorber on all pages.
Any suggestions ?

Also, is there a out of box conversion from PDF to JSON?

asad.ali · December 9, 2020, 5:40pm

@Sudharsann01

Aspose.PDF can extract paragraphs from a PDF that run over two columns using ParagraphAbsorber Class. However, detection of a Page for multi-columns is not yet investigated and we are afraid that we cannot share any workaround until the related logged ticket PDFNET-49156 is resolved.

Could you please share a sample JSON file with us which you want to obtain from converting a PDF. We will check the feasibility of this feature and share our feedback with you.

Sudharsann01 · December 9, 2020, 5:47pm

@asad.ali
Using the ParagraphAbsorber on multi column PDF doesn’t retain the order in which the text are appearing. Can you share a sample of PDF and code where it is retaining the order ?

For PDF to JSON, you can use the PDF that I attached earlier as sample.

Sudharsann01 · December 9, 2020, 8:19pm

@asad.ali On a different note , is there a way to generate bookmarks out of existing TOC in a PDF ?

asad.ali · December 10, 2020, 6:11pm

@Sudharsann01

Please check following code snippet to extract paragraphs that run over two columns.

Document doc = new Document(myDir + "MultiColumnPdf.pdf");

ParagraphAbsorber absorber = new ParagraphAbsorber();
absorber.Visit(doc);

PageMarkup markup = absorber.PageMarkups[0];

Console.WriteLine("IsMulticolumnParagraphsAllowed == false\r\n");

MarkupSection section = markup.Sections[2];
MarkupParagraph paragraph = section.Paragraphs[section.Paragraphs.Count - 1];

Console.WriteLine("Section at {0} last paragraph text:\r\n", section.Rectangle.ToString());
Console.WriteLine(paragraph.Text);

section = markup.Sections[1];
paragraph = section.Paragraphs[0];

Console.WriteLine("\r\nSection at {0} first paragraph text:\r\n", section.Rectangle.ToString());
Console.WriteLine(paragraph.Text);

markup.IsMulticolumnParagraphsAllowed = true;
Console.WriteLine("\r\nIsMulticolumnParagraphsAllowed == true\r\n");

section = markup.Sections[2];
paragraph = section.Paragraphs[section.Paragraphs.Count - 1];

Console.WriteLine("Section at {0} last paragraph text:\r\n", section.Rectangle.ToString());
Console.WriteLine(paragraph.Text);

section = markup.Sections[1];
paragraph = section.Paragraphs[0];

Console.WriteLine("\r\nSection at {0} first paragraph text:\r\n", section.Rectangle.ToString());
Console.WriteLine(paragraph.Text);

You can switch paragraphs presentation between default and multi-column mode using IsMulticolumnParagraphsAllowed property of PageMarkup object.

IsMulticolumnParagraphsAllowed value of ‘false’ shows paragraph parts in different sections as independent MarkupParagraph objects. And value of ‘true’ shows paragraph parts as single MarkupParagraph object.

Please take into account it works for paragraphs that runs over two columns. The case when paragraph runs over three (or more) columns is mach more complicated. It solution will require additional time. We have created separated task PDFNET-45323 for this.

We requested a JSON file (expected output) which could be a sample JSON format file so that we can know in what style you want to obtain it from PDF.

We need to check this feasibility and are currently investigating it at our end. We will soon get back to you with our feedback.

Sudharsann01 · December 10, 2020, 8:29pm

@asad.ali Thanks for the information.

Can you share the MultiColumnPdf.pdf that you used in the code? So, that we can understand the code.
If converting table of contents to bookmarks is under investigation, what about just identifying table of contents section and retrieving it ? Is that possible?
I am unable to attach JSON as it is not one of the allowed file extension. Below is what we are looking at.
{
“bookmarks”: [
{
“name”: “Bookmark1”,
“text”: “Contents of Bookmark1”,
“sub_bookmarks”: [
{
“name”: “SubBookmark1”,
“text”: “Contents of SubBookmark1”,
“sub_bookmarks”: [
{
“name”: “SubBookmark2”,
“text”: “Contents of SubBookmark2”
}
]
}
]
},
{
“name”: “Bookmark2”,
“text”: “Contents of Bookmark2”
}
]
}

asad.ali · December 11, 2020, 10:44pm

@Sudharsann01

We tried to convert the TOC into Bookmarks using your document and following code snippet but did not get much success:

Document doc = new Document(dataDir + "CBRE Employee_Handbook_For_EHA.pdf");
var tocPage = doc.Pages[2];
List<Facades.Bookmark> lstBookmarks = new List<Facades.Bookmark>();
foreach(Annotation annotation in tocPage.Annotations)
{
 if(annotation is LinkAnnotation)
 {
  var lnkAnnot = (LinkAnnotation)annotation;
  var goToAction = (GoToAction)lnkAnnot.Action;
  var xyzdest = (goToAction.Destination as XYZExplicitDestination);
  var title = lnkAnnot.Contents;
  Facades.Bookmark bookmark = new Facades.Bookmark();
  bookmark.PageNumber = xyzdest.PageNumber;
  bookmark.Title = title;
  bookmark.PageDisplay_Top = Convert.ToInt32(xyzdest.Top);
  bookmark.PageDisplay_Left = Convert.ToInt32(xyzdest.Left);
  lstBookmarks.Add(bookmark);
 }
}

Facades.PdfBookmarkEditor editor = new Facades.PdfBookmarkEditor();
editor.BindPdf(doc);
foreach(Facades.Bookmark bookmark in lstBookmarks)
{
 editor.CreateBookmarks(bookmark);
}

doc.Save(dataDir + "output.pdf");

Therefore, a ticket has been logged for further investigation as PDFNET-49170 in our issue tracking system. However, you can check in above code snippet that we were able to extract the links from the TOC, however, we could not extract link text using the Annotation.Contents property which was always NULL.

As per looking at the JSON format you shared, it seems like you want to export bookmarks to JSON file. A similar functionality is already offered by the API in XML export. Please try using following code snippet where API can generate an XML file containing the bookmarks definition in similar format and let us know if it does not suit you:

Document doc = new Document(dataDir + "source.pdf");
Facades.PdfBookmarkEditor editor = new Facades.PdfBookmarkEditor();
editor.BindPdf(doc);
editor.ExportBookmarksToXML(dataDir + "bookmarks.xml");

MultiColumnPdf.pdf (3.1 KB)

Sudharsann01 · December 14, 2020, 10:04pm

@asad.ali In the JSON that I shared, we are not just looking for each bookmark but the actual content enclosed within in them.

Below is what I’ve come up with to convert TOC to Bookmark. It is not 100% accurate due to the issue of text extraction. Please check on whether you can fix the issue of text extraction.

Bookmarks bookmarks = new Bookmarks();

        Aspose.Pdf.Document pdfDocument = new Aspose.Pdf.Document(@"C:\Users\narayanasamys\Downloads\Employee Handbook Examples\SampleEmployeeHandbook for RETAIL.pdf");

        for (int index = 0; index < 10; index++)
        {
            var annotations = pdfDocument.Pages[index + 1].Annotations.Where(x => x.AnnotationType == AnnotationType.Link).ToList();
            foreach (var anno in annotations)
            {
                var linkAnnotation = (anno as Aspose.Pdf.Annotations.LinkAnnotation);
                if (linkAnnotation.Action == null)
                {
                    TextAbsorber absorber = new TextAbsorber();
                    absorber.TextSearchOptions.LimitToPageBounds = true;
                    absorber.TextSearchOptions.Rectangle = anno.Rect;
                    pdfDocument.Pages[index + 1].Accept(absorber);
                    string extractedText = absorber.Text;
                    System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(@"(\d+)\s*$");
                    var match = regex.Match(extractedText);
                    if (linkAnnotation.Destination is ExplicitDestination destination)
                    {
                        string title = "";
                        if (!string.IsNullOrEmpty(extractedText) && match.Success)
                        {
                            int pageNumber = Convert.ToInt32(match.Groups[1].Value);
                            title = extractedText.Substring(0, extractedText.LastIndexOf(pageNumber.ToString()) - 1)
                                    .Replace(Environment.NewLine, string.Empty).Replace(".", string.Empty).Trim();
                        }
                        var bookmark = new Aspose.Pdf.Facades.Bookmark
                        {
                            Title = title,
                            Level = 1,
                            ChildItems = new Bookmarks(),
                            PageNumber = destination.PageNumber
                        };
                        bookmarks.Add(bookmark);
                    }
                    else
                    {
                        if (!string.IsNullOrEmpty(extractedText) && match.Success)
                        {
                            int pageNumber = Convert.ToInt32(match.Groups[1].Value);
                            if (extractedText.LastIndexOf(pageNumber.ToString()) > 0)
                            {
                                string title = extractedText.Substring(0, extractedText.LastIndexOf(pageNumber.ToString()) - 1)
                                    .Replace(Environment.NewLine, string.Empty).Replace(".", string.Empty).Trim();
                                var bookmark = new Aspose.Pdf.Facades.Bookmark
                                {
                                    Title = title,
                                    Level = 1,
                                    ChildItems = new Bookmarks(),
                                    PageNumber = pageNumber
                                };
                                bookmarks.Add(bookmark);
                            }
                        }
                    }
                }
                else if(linkAnnotation.Action is Aspose.Pdf.Annotations.GoToAction action)
                {
                    TextAbsorber absorber = new TextAbsorber();
                    absorber.TextSearchOptions.LimitToPageBounds = true;
                    absorber.TextSearchOptions.Rectangle = anno.Rect;
                    pdfDocument.Pages[index + 1].Accept(absorber);
                    string extractedText = absorber.Text;
                    System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(@"(\d+)\s*$");
                    var match = regex.Match(extractedText);
                    if (!string.IsNullOrEmpty(extractedText) && match.Success)
                    {
                        int pageNumber = Convert.ToInt32(match.Groups[1].Value);
                        if (extractedText.LastIndexOf(pageNumber.ToString()) > 0)
                        {
                            string title = extractedText.Substring(0, extractedText.LastIndexOf(pageNumber.ToString()) - 1)
                                .Replace(Environment.NewLine, string.Empty).Replace(".", string.Empty).Trim();
                            var bookmark = new Aspose.Pdf.Facades.Bookmark
                            {
                                Title = title,
                                Level = 1,
                                ChildItems = new Bookmarks(),
                                PageNumber = pageNumber
                            };
                            bookmarks.Add(bookmark);
                        }
                    }
                    else
                    {
                        if (action.Destination is ExplicitDestination destination)
                        {
                            string title = extractedText.Replace(Environment.NewLine, string.Empty).Replace(".", string.Empty).Trim();
                            var bookmark = new Aspose.Pdf.Facades.Bookmark
                            {
                                Title = title,
                                Level = 1,
                                ChildItems = new Bookmarks(),
                                PageNumber = destination.PageNumber
                            };
                            bookmarks.Add(bookmark);
                        }
                    }
                }
            }
        }

asad.ali · December 15, 2020, 10:15pm

@Sudharsann01

We are looking into it and will get back to you shortly.

Sudharsann01 · December 30, 2020, 6:07pm

@asad.ali Any update on this ?
Also, is there a way to identify whether page is multi-column or single column ?

asad.ali · January 4, 2021, 3:49pm

@Sudharsann01

We have tested your code snippet and obtained the attached output PDF.

output.pdf (1.7 MB)

Would you kindly check it and point out the exact issues you want to report.

As shared previously, we have logged an enhancement ticket as PDFNET-49156 in our issue tracking system in order to implement a property that can help to identify whether a Page is multi-column or not. As soon as it is implemented, we will update you within this forum thread. Please give us some time.

We apologize for the inconvenience.

Sudharsann01 · January 13, 2021, 1:50pm

I see that you’ve attached a PDF but not sure why and what you want me to check on it ? Can you please clarify ?

The code I attached generates a in memory bookmarks only. Later in the code we use this bookmark to extract the actual content within each bookmark entry.

asad.ali · January 13, 2021, 8:50pm

@Sudharsann01

We modified the code snippet in order to add bookmarks in the PDF obtained from the part of the code that you had shared. The complete code snippet that was used is as below:

Document pdfDocument = new Document(dataDir + "CBRE Employee_Handbook_For_EHA.pdf");
            List<Facades.Bookmark> bookmarks = new List<Facades.Bookmark>();
            for (int index = 0; index < 10; index++)
            {
                var annotations = pdfDocument.Pages[index + 1].Annotations.Where(x => x.AnnotationType == AnnotationType.Link).ToList();
                foreach (var anno in annotations)
                {
                    var linkAnnotation = (anno as Aspose.Pdf.Annotations.LinkAnnotation);
                    if (linkAnnotation.Action == null)
                    {
                        TextAbsorber absorber = new TextAbsorber();
                        absorber.TextSearchOptions.LimitToPageBounds = true;
                        absorber.TextSearchOptions.Rectangle = anno.Rect;
                        pdfDocument.Pages[index + 1].Accept(absorber);
                        string extractedText = absorber.Text;
                        System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(@"(\d+)\s*$");
                        var match = regex.Match(extractedText);
                        if (linkAnnotation.Destination is ExplicitDestination destination)
                        {
                            string title = "";
                            if (!string.IsNullOrEmpty(extractedText) && match.Success)
                            {
                                int pageNumber = Convert.ToInt32(match.Groups[1].Value);
                                title = extractedText.Substring(0, extractedText.LastIndexOf(pageNumber.ToString()) - 1)
                                        .Replace(Environment.NewLine, string.Empty).Replace(".", string.Empty).Trim();
                            }
                            var bookmark = new Aspose.Pdf.Facades.Bookmark
                            {
                                Title = title,
                                Level = 1,
                                ChildItems = new Facades.Bookmarks(),
                                PageNumber = destination.PageNumber
                            };
                            bookmarks.Add(bookmark);
                        }
                        else
                        {
                            if (!string.IsNullOrEmpty(extractedText) && match.Success)
                            {
                                int pageNumber = Convert.ToInt32(match.Groups[1].Value);
                                if (extractedText.LastIndexOf(pageNumber.ToString()) > 0)
                                {
                                    string title = extractedText.Substring(0, extractedText.LastIndexOf(pageNumber.ToString()) - 1)
                                        .Replace(Environment.NewLine, string.Empty).Replace(".", string.Empty).Trim();
                                    var bookmark = new Aspose.Pdf.Facades.Bookmark
                                    {
                                        Title = title,
                                        Level = 1,
                                        ChildItems = new Facades.Bookmarks(),
                                        PageNumber = pageNumber
                                    };
                                    bookmarks.Add(bookmark);
                                }
                            }
                        }
                    }
                    else if (linkAnnotation.Action is Aspose.Pdf.Annotations.GoToAction action)
                    {
                        TextAbsorber absorber = new TextAbsorber();
                        absorber.TextSearchOptions.LimitToPageBounds = true;
                        absorber.TextSearchOptions.Rectangle = anno.Rect;
                        pdfDocument.Pages[index + 1].Accept(absorber);
                        string extractedText = absorber.Text;
                        System.Text.RegularExpressions.Regex regex = new System.Text.RegularExpressions.Regex(@"(\d+)\s*$");
                        var match = regex.Match(extractedText);
                        if (!string.IsNullOrEmpty(extractedText) && match.Success)
                        {
                            int pageNumber = Convert.ToInt32(match.Groups[1].Value);
                            if (extractedText.LastIndexOf(pageNumber.ToString()) > 0)
                            {
                                string title = extractedText.Substring(0, extractedText.LastIndexOf(pageNumber.ToString()) - 1)
                                    .Replace(Environment.NewLine, string.Empty).Replace(".", string.Empty).Trim();
                                var bookmark = new Aspose.Pdf.Facades.Bookmark
                                {
                                    Title = title,
                                    Level = 1,
                                    ChildItems = new Facades.Bookmarks(),
                                    PageNumber = pageNumber
                                };
                                bookmarks.Add(bookmark);
                            }
                        }
                        else
                        {
                            if (action.Destination is ExplicitDestination destination)
                            {
                                string title = extractedText.Replace(Environment.NewLine, string.Empty).Replace(".", string.Empty).Trim();
                                var bookmark = new Aspose.Pdf.Facades.Bookmark
                                {
                                    Title = title,
                                    Level = 1,
                                    ChildItems = new Facades.Bookmarks(),
                                    PageNumber = destination.PageNumber
                                };
                                bookmarks.Add(bookmark);
                            }
                        }
                    }
                }
            }
            Facades.PdfBookmarkEditor editor = new Facades.PdfBookmarkEditor();
            editor.BindPdf(pdfDocument);
            foreach (Facades.Bookmark bookmark in bookmarks)
            {
                editor.CreateBookmarks(bookmark);
            }

            pdfDocument.Save(dataDir + "output.pdf");

The shared output was generated by this code and you can see it contains the bookmarks as well. We requested you to check it and share your feedback with us if you notice any issue with it.

We did not notice any error during text extraction at our side as you mentioned. Would you please explain a bit more about it so that we can further proceed to assist you accordingly.