How to PDF file to allow user to select a text in java

Shivah6 · March 20, 2025, 12:17pm

I have a pdf file in which every page is an image. I want to provide copy and paste functionality where user can select text in PDF, copy it and paste it. However since the page is an image, a user cannot select a text.
I think I have to OCR the page. I just want to allow user to select/copy/paste text.

Please suggest an options using Aspose APIs in java.

Professionalize.Discourse · March 20, 2025, 12:17pm

@Shivah6

To enable text selection and copying in a PDF file where each page is an image, you will need to perform Optical Character Recognition (OCR) on the images to convert them into searchable text. You can achieve this using Aspose.PDF for Java along with an external OCR tool like Tesseract.

Here’s a step-by-step guide on how to implement this:

Load the PDF Document: Start by loading your PDF document that contains images.
Set Up OCR Callback: Create a callback that will handle the OCR processing for each image in the PDF. This callback will convert the images to text using Tesseract.
Convert the PDF: Use the convert method of the Document class to process the images and extract text.
Save the Output: Finally, save the modified PDF which now contains selectable text.

Here is a sample code snippet demonstrating this process:

package com.aspose.pdf.examples;

import com.aspose.pdf.Document;
import com.aspose.pdf.Document.CallBackGetHocr;

import javax.imageio.ImageIO;
import java.awt.image.BufferedImage;
import java.io.File;
import java.io.IOException;
import java.util.Scanner;

public class ConvertNonSearchablePDFToSearchablePDF {
    public static void main(String[] args) {
        final String myDir = "PathToDir"; // Specify your directory path
        Document doc = new Document(myDir + "input.pdf"); // Load your PDF file

        // Create a callback for OCR processing
        CallBackGetHocr cbgh = new CallBackGetHocr() {
            @Override
            public String invoke(BufferedImage img) {
                File outputfile = new File(myDir + "temp.jpg");
                try {
                    ImageIO.write(img, "jpg", outputfile); // Save the image temporarily
                    Process process = Runtime.getRuntime().exec("tesseract " + myDir + "temp.jpg " + myDir + "output hocr");
                    process.waitFor(); // Wait for the OCR process to complete
                } catch (IOException | InterruptedException e) {
                    e.printStackTrace();
                }

                // Read the generated HTML file to get the recognized text
                File file = new File(myDir + "output.html");
                StringBuilder fileContents = new StringBuilder();
                try (Scanner scanner = new Scanner(file)) {
                    while (scanner.hasNextLine()) {
                        fileContents.append(scanner.nextLine()).append(System.lineSeparator());
                    }
                } catch (IOException e) {
                    e.printStackTrace();
                }

                // Clean up temporary files
                outputfile.delete();
                return fileContents.toString(); // Return the recognized text
            }
        };

        // Convert the PDF using the OCR callback
        doc.convert(cbgh);
        doc.save(myDir + "output.pdf"); // Save the searchable PDF
    }
}

Important Notes:

Ensure that Tesseract is installed on your system and is accessible from the command line.
Adjust the paths in the code to point to your actual file locations.
This code will create a new PDF file (output.pdf) that contains the recognized text, allowing users to select and copy text from it.

By following these steps, you can successfully convert a non-searchable PDF into a searchable one, enabling text selection and copying functionality for users.

Shivah6 · March 20, 2025, 12:35pm

My pdf document has tabular columns, So I want to select the text from that pdf.
Actually I’m converting ipx images to pdf. but pdf not searchable. I want my pdf need to be select/copy/paste text.

Please provide an aspose java API for the same.

asad.ali · March 20, 2025, 7:38pm

@Shivah6

Aspose.PDF does not provide feature to perform OCR on scanned PDF documents. It however offers you a way to use external Google Tesseract to perform OCR and place a hidden layer of text over images. The code provided by the bot serves this purpose and you can use it. Please let us know if you have tried it and if you have faced any issues?