Read PDF Into A Traversable Object Tree

JessIsCoding · November 26, 2023, 3:10am

Howdy!

I’m writing a program with Aspose.PDF.NET to extract information from a pdf file created by a form (i.e. the structure of the pdf is determined and unchanging). I can read the file text and find what I need, but that seems brute force and perhaps not taking advantage of what Aspost.PDF.NET has to offer.

Does Aspose.PDF.Net include functionality to read the pdf file in a traversable object? By traversable object I mean something where I can access contents of the pdf file as pdfFile.Contents.Headings[1].Text (I hope that example makes sense), instead of knowing the header text is on line 60 for a predetermined pdf file format used by company X.

Thanks!

Jess

asad.ali · November 26, 2023, 1:16pm

@JessIsCoding

Sadly, Aspose.PDF does not offer feature to read/traverse the document structure. However, we can log a feature request in case you provide a sample PDF for our reference with a sample output. We will generate a ticket in our issue tracking system and share the ID with you.

JessIsCoding · November 27, 2023, 5:02am

I’m afraid I can’t share the PDF file, because it’s proprietary.

asad.ali · November 27, 2023, 2:00pm

@JessIsCoding

Below is a sample code snippet to traverse a tagged PDF document and its content:

private static void TraverseTaggedContent(string dataDir)
{
    var pdfDocument = new Document(dataDir + "FinRep - Example.pdf");

    Traverse(pdfDocument.TaggedContent.RootElement);
    System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
    sw.Start();
    pdfDocument.Save(dataDir + "output.pdf");
    sw.Stop();
    Console.WriteLine("Total time taken: " + sw.Elapsed.TotalSeconds);
}

private static void Traverse(StructureElement element)
{
    foreach (var child in element.ChildElements)
    {
        if (typeof(StructureElement).IsAssignableFrom(child.GetType()))
        {
            Traverse((StructureElement)child);
        }
    }
}

Please let us know if that helps.