Determine if PDF File contains images or text using Aspose.PDF for .Java

saipravina · May 20, 2022, 11:35am

Hello,
I have found that the pdf file contains images or text have been implemented in Aspose.PDF for .NET(Find whether PDF contains images or text|Aspose.PDF for .NET). Do we have the similar implementation support for Aspose .PDF for java to find whether the pdf contains images or text.? If so can you please share the API Code.

Thanks,
Saipravina

tahir.manzoor · May 20, 2022, 4:13pm

@saipravina

You can use following cod example to achieve your requirement. We suggest you please read the following articles.
Extract Text from PDF File
Extract Images using PdfExtractor

PdfExtractor extractor = new PdfExtractor();

// Bind the input PDF document to extractor
extractor.bindPdf("input.pdf");
// Extract text from the input PDF document
extractor.extractText();
// Save the extracted text to a text file
extractor.getText("out.txt");

// Extract images from the input PDF document
extractor.extractImage();

// Calling HasNextImage method in while loop. When images will finish, loop will exit
Boolean containsImage = extractor.hasNextImage();

saipravina · May 23, 2022, 8:12am

@tahir.manzoor
Do we have API support to determine PDF page by page to identify page is an image or text?

tahir.manzoor · May 23, 2022, 12:16pm

@saipravina

Yes, you can check images of a PDF page. Please check the code example from here:
Extract Images from PDF (facades)

Following code example shows how to extract images from specific pages of PDF.


    //Create an extractor and bind it to the document
    Document document = new Document(_dataDir + "sample.pdf");
    PdfExtractor extractor = new PdfExtractor(document);
    extractor.setStartPage(1);
    extractor.setEndPage(3);            

    //Run the extractor
    extractor.extractImage();
    int imageNumber = 1;
    //Iterate througth extracted images collection
    while (extractor.hasNextImage())
    {
        //Retrieve image from collection and save it in a file 
        extractor.getNextImage(_dataDir + String.format("image%03d.png", imageNumber++),ImageType.getPng());
    }

You can use following code example to extract text of each page.

    // open document
    Document pdfDocument = new Document("input.pdf");
    // text file in which extracted text will be saved
    
    // iterate through all the pages of PDF file
    for (Page page : (Iterable<Page>) pdfDocument.getPages()) {
        // create text device
        TextDevice textDevice = new TextDevice();
        // set text extraction options - set text extraction mode (Raw or
        // Pure)
        TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);
        textDevice.setExtractionOptions(textExtOptions);
        // get the text from pages of PDF and save it to OutputStream object
        java.io.OutputStream text_stream = new java.io.FileOutputStream("ExtractedText.txt", false);
        textDevice.process(page, text_stream);
    }
    // close stream object
    text_stream.close();

You can also use TextAbsorber to get text of specific page. In this case you need to call the Accept method on a particular page of the Document object. The Index is the particular page number from where text needs to be extracted.