Could not read images in PDFs using Aspose.PDF Java

Ozer_Uyanik · April 3, 2023, 10:53am

I’m trying to read attached PDF file using Aspose.pdf JAVA library. In this PDF file there is an element (image.png (482 Bytes)

) which I’m unable to extract using Aspose.pdf library. I need to check whether there is any way to catch these kind of elements using Aspose.pdf. Following is the code snipped that I’m using to extract text from the PDF.

public static void ExtractTextFromPDF(){
	    // The path to the documents directory.
	    String _dataDir = "D:\\Users\\A\\Downloads\\extractpdf\\";
	    String filePath = _dataDir + "1372333CB_20181105_00920.pdf";

	    com.aspose.pdf.License license = new com.aspose.pdf.License();
		try {
			license.setLicense("D:\\Users\\A\\Downloads\\extractpdf\\Aspose.PDF.Java.lic");
		} catch (Exception e1) {
			e1.printStackTrace();
		}
		
	    // Open document
	    com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(filePath);

	    // Create TextAbsorber object to extract text
	    com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
	    
	    // Accept the absorber for all the pages
	    pdfDocument.getPages().accept(textAbsorber);
	    
	    // Get the extracted text
	    String extractedText = textAbsorber.getText();                
	    try {
	        java.io.FileWriter writer = new java.io.FileWriter(_dataDir + "extracted-text.txt", true);
	        // Write a line of text to the file
	        writer.write(extractedText);            
	        // Close the stream
	        writer.close();
	    } catch (java.io.IOException e) {
	        e.printStackTrace();
	    }

	}

Here is the PDF file - 0AB4CD_1.PDF (195.2 KB)

carlos.molina · April 3, 2023, 7:31pm

@Ozer_Uyanik,

Can you tell me what Aspose.Pdf Api version you are using?

I just did the same exercise in C# and Java, and both worked out fine:

C# Aspose.Pdf Version 23.3

private void Logic()
{
    var doc = new Document($"{PartialPath}_input.pdf");

    TextAbsorber textAbsorber = new TextAbsorber();

    doc.Pages.Accept(textAbsorber);

     string extractedText = textAbsorber.Text;

    Console.WriteLine(extractedText);

    Console.ReadKey();
}

Java Aspose.Pdf Version 23.2

public void Logic() throws Exception
{
    var doc = new Document(PartialPath + "_input.pdf");

    TextAbsorber textAbsorber = new TextAbsorber();

    doc.getPages().accept(textAbsorber);

    String extractedText = textAbsorber.getText();

    System.out.println(extractedText);
}

EDIT: I thought it was a missing character.
Are you talking about the checkbox?

After inspecting it with Adobe Acrobat pro I noticed is not a checkbox, is a Square with 2 diagonal lines.
So the TextAbsorber wont display it. And Form Field wont get it either.

Ozer_Uyanik · April 4, 2023, 7:17am

Your inspection is correct. Is there any way to read this square with 2 diagonal lines? May be as an image?

carlos.molina · April 4, 2023, 8:22pm

@Ozer_Uyanik,

Sorry for the late response but I was playing around with it(trying to figure out a way to read it before responding) and sadly I was unable to figure out a way to do it.

Because even if I try to use a TextFragmentAbsorber(with coordinates) is only for text. There is not way to retrieve anything besides that.

Ozer_Uyanik · April 5, 2023, 6:33am

@carlos.molina

Thank you very much for your effort. I have an other idea. What about extract the check boxes as an images like this?

XImage xImage = pdfDocument.getPages().get_Item(1).getResources().getImages().get_Item(1);

When I try to execute above line I’m getting following error message.

Exception in thread “main” class com.aspose.pdf.internal.ms.System.lh: Invalid index: index should be in the range [1…n] where n equals to the images count.

carlos.molina · April 5, 2023, 2:10pm

@Ozer_Uyanik,

That happens because it is not an image either. If you check the count of the Images, you will see it is 0, and getting an index 1 out of zero will throw that exception.

Ozer_Uyanik · April 6, 2023, 3:57pm

@carlos.molina

When I convert the pdf into HTML using Adobe Acrobat reader I can extract these checkboxes as images. I’m attaching the images here for your reference. 0AB4CD_1.zip (1.6 KB)

Do we have such a same possibility with Aspose already or can you suggest the development team for an update with this feature?

carlos.molina · April 6, 2023, 4:20pm

@Ozer_Uyanik,

Since I am not a developer, I do not know how the conversion process to Html from Acrobat Reader works.

Also, that is something different from what we were talking about before. Conversion to Html and reading a PDF that has no checkboxes.

I cannot suggest to the developers or ask them for new features. I have no input in that regard. I can only create tickets if I find a bug in an existing functionality.

If you need some custom functionality or special request, you can request it via Paid Support.