Extract Chapter / Sections from PDF text layer

paulrinckens · October 24, 2018, 11:37am

Hi all,

I want to split PDF files not by page but by chapter, section or heading. The input PDF files have a text layer.
How can this be achieved using Aspose.PDF for Java? So far I have not discovered any methods to access the text layer’s hierarchy/structure.

Thank you for your support!

Paul

Farhan.Raza · October 24, 2018, 6:40pm

@paulrinckens

Thank you for contacting support.

Would you please share sample PDF file while elaborating the page numbers and layer names based on which you want to split that PDF document, so that we may investigate it in our environment to help you out.

paulrinckens · October 25, 2018, 6:24am

@Farhan.Raza

Thank you for the quick reply. See this link for an example pdf file: https://www.einfach-fuer-alle.de/download/pdf_barrierefrei.pdf

Best regards,

Paul

Farhan.Raza · October 25, 2018, 5:03pm

@paulrinckens

Would you please elaborate if you are referring to page layers or document bookmarks because Adobe Acrobat does not display any layer name or information for shared PDF document. Please share a screenshot to elaborate your requirements a little more.

paulrinckens · October 29, 2018, 12:21pm

Hi Farhan,

thanks for your reply!

I’m referring to page layers, no bookmarks.
In the document mentioned I would like to split the document by its content, meaning:

First split: 1. Einleitung
[all text from the part 1.Einleitung]

Second split: 2. Welche Einstellungsoptionen bietet der Acrobat Reader von sich aus
für Menschen mit körperlichen Einschränkungen?
[all text from the part 2. Welche Einstellungsoptionen bietet … ]

Third split: _3. Was macht ein PDF-Dokument grundsätzlich unzugänglich? _
[all text from the part 3. Was macht ein PDF-Dokument …]

I hope this clarifies my intent.
How does the Aspose parsing of PDF text layers work in detail? Is there a structural representation of the documents text hierarchy that can be accessed by the API?

Best regards, Paul

Farhan.Raza · October 29, 2018, 7:07pm

@paulrinckens

Thank you for elaborating.

Would you please share how are you noticing the layers as Adobe Acrobat, Adobe Reader or Foxit Reader are not displaying any layer for shared PDF file. We have also tried to check with Aspose.PDF for Java API but no layer can be detected. Please elaborate with the help of screenshots while mentioning the application you are using to view layers.

Document document = new Document(dataDir + "pdf_barrierefrei.pdf");
for(Page page : document.getPages())
{
    List<com.aspose.pdf.Layer> layers = page.getLayers();
    if (layers != null)
    {
        for (com.aspose.pdf.Layer layer : layers)
        {
            System.out.println(layer.getName());
        }
    }
    else
    {
        System.out.println("Page number "+page.getNumber()+" does not contain any layer");
    }
}