Howdy!
I’m writing a program with Aspose.PDF.NET to extract information from a pdf file created by a form (i.e. the structure of the pdf is determined and unchanging). I can read the file text and find what I need, but that seems brute force and perhaps not taking advantage of what Aspost.PDF.NET has to offer.
Does Aspose.PDF.Net include functionality to read the pdf file in a traversable object? By traversable object I mean something where I can access contents of the pdf file as pdfFile.Contents.Headings[1].Text (I hope that example makes sense), instead of knowing the header text is on line 60 for a predetermined pdf file format used by company X.
Thanks!
Jess
@JessIsCoding
Sadly, Aspose.PDF does not offer feature to read/traverse the document structure. However, we can log a feature request in case you provide a sample PDF for our reference with a sample output. We will generate a ticket in our issue tracking system and share the ID with you.
I’m afraid I can’t share the PDF file, because it’s proprietary.
@JessIsCoding
Below is a sample code snippet to traverse a tagged PDF document and its content:
private static void TraverseTaggedContent(string dataDir)
{
var pdfDocument = new Document(dataDir + "FinRep - Example.pdf");
Traverse(pdfDocument.TaggedContent.RootElement);
System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
sw.Start();
pdfDocument.Save(dataDir + "output.pdf");
sw.Stop();
Console.WriteLine("Total time taken: " + sw.Elapsed.TotalSeconds);
}
private static void Traverse(StructureElement element)
{
foreach (var child in element.ChildElements)
{
if (typeof(StructureElement).IsAssignableFrom(child.GetType()))
{
Traverse((StructureElement)child);
}
}
}
Please let us know if that helps.