We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Determine if PDF File contains images or text using Aspose.PDF for .Java

Hello,
I have found that the pdf file contains images or text have been implemented in Aspose.PDF for .NET(https://docs.aspose.com/pdf/net/find-whether-pdf-file-contains-images-or-text-only/). Do we have the similar implementation support for Aspose .PDF for java to find whether the pdf contains images or text.? If so can you please share the API Code.

Thanks,
Saipravina

@saipravina

You can use following cod example to achieve your requirement. We suggest you please read the following articles.
Extract Text from PDF File
Extract Images using PdfExtractor

PdfExtractor extractor = new PdfExtractor();

// Bind the input PDF document to extractor
extractor.bindPdf("input.pdf");
// Extract text from the input PDF document
extractor.extractText();
// Save the extracted text to a text file
extractor.getText("out.txt");

// Extract images from the input PDF document
extractor.extractImage();

// Calling HasNextImage method in while loop. When images will finish, loop will exit
Boolean containsImage = extractor.hasNextImage();

@tahir.manzoor
Do we have API support to determine PDF page by page to identify page is an image or text?

@saipravina

Yes, you can check images of a PDF page. Please check the code example from here:
Extract Images from PDF (facades)

Following code example shows how to extract images from specific pages of PDF.


    //Create an extractor and bind it to the document
    Document document = new Document(_dataDir + "sample.pdf");
    PdfExtractor extractor = new PdfExtractor(document);
    extractor.setStartPage(1);
    extractor.setEndPage(3);            

    //Run the extractor
    extractor.extractImage();
    int imageNumber = 1;
    //Iterate througth extracted images collection
    while (extractor.hasNextImage())
    {
        //Retrieve image from collection and save it in a file 
        extractor.getNextImage(_dataDir + String.format("image%03d.png", imageNumber++),ImageType.getPng());
    }

You can use following code example to extract text of each page.

    // open document
    Document pdfDocument = new Document("input.pdf");
    // text file in which extracted text will be saved
    
    // iterate through all the pages of PDF file
    for (Page page : (Iterable<Page>) pdfDocument.getPages()) {
        // create text device
        TextDevice textDevice = new TextDevice();
        // set text extraction options - set text extraction mode (Raw or
        // Pure)
        TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);
        textDevice.setExtractionOptions(textExtOptions);
        // get the text from pages of PDF and save it to OutputStream object
        java.io.OutputStream text_stream = new java.io.FileOutputStream("ExtractedText.txt", false);
        textDevice.process(page, text_stream);
    }
    // close stream object
    text_stream.close();

You can also use TextAbsorber to get text of specific page. In this case you need to call the Accept method on a particular page of the Document object. The Index is the particular page number from where text needs to be extracted.