We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Determine if PDF File contains images or text using Aspose.PDF for .Java

I have found that the pdf file contains images or text have been implemented in Aspose.PDF for .NET(https://docs.aspose.com/pdf/net/find-whether-pdf-file-contains-images-or-text-only/). Do we have the similar implementation support for Aspose .PDF for java to find whether the pdf contains images or text.? If so can you please share the API Code.



You can use following cod example to achieve your requirement. We suggest you please read the following articles.
Extract Text from PDF File
Extract Images using PdfExtractor

PdfExtractor extractor = new PdfExtractor();

// Bind the input PDF document to extractor
// Extract text from the input PDF document
// Save the extracted text to a text file

// Extract images from the input PDF document

// Calling HasNextImage method in while loop. When images will finish, loop will exit
Boolean containsImage = extractor.hasNextImage();

Do we have API support to determine PDF page by page to identify page is an image or text?


Yes, you can check images of a PDF page. Please check the code example from here:
Extract Images from PDF (facades)

Following code example shows how to extract images from specific pages of PDF.

    //Create an extractor and bind it to the document
    Document document = new Document(_dataDir + "sample.pdf");
    PdfExtractor extractor = new PdfExtractor(document);

    //Run the extractor
    int imageNumber = 1;
    //Iterate througth extracted images collection
    while (extractor.hasNextImage())
        //Retrieve image from collection and save it in a file 
        extractor.getNextImage(_dataDir + String.format("image%03d.png", imageNumber++),ImageType.getPng());

You can use following code example to extract text of each page.

    // open document
    Document pdfDocument = new Document("input.pdf");
    // text file in which extracted text will be saved
    // iterate through all the pages of PDF file
    for (Page page : (Iterable<Page>) pdfDocument.getPages()) {
        // create text device
        TextDevice textDevice = new TextDevice();
        // set text extraction options - set text extraction mode (Raw or
        // Pure)
        TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw);
        // get the text from pages of PDF and save it to OutputStream object
        java.io.OutputStream text_stream = new java.io.FileOutputStream("ExtractedText.txt", false);
        textDevice.process(page, text_stream);
    // close stream object

You can also use TextAbsorber to get text of specific page. In this case you need to call the Accept method on a particular page of the Document object. The Index is the particular page number from where text needs to be extracted.