Cant extract text from PDF

brett5d90e · January 12, 2023, 2:28am

I tried the demo, but I can’t extract any demo text from another PDF: https://www.nmlegis.gov/Sessions/22%20Regular/bills/senate/SB0014.pdf

package com.aspose.ocr.examples.OcrFeatures;

import com.aspose.ocr.DocumentRecognitionSettings;
import com.aspose.ocr.Language;
import com.aspose.ocr.RecognitionResult;
import com.aspose.ocr.RecognitionResult.LinesResult;
import com.aspose.ocr.pdf.AsposeOCRPdf;
import com.aspose.ocr.examples.Utils;

import java.awt.*;
import java.util.ArrayList;

public class OCRRecognizePdf {

public static void main(String[] args) {
	// ExStart:1
	// The path to the documents directory.
	String dataDir = Utils.getSharedDataDir(OCRRecognizePdf.class);

	// The image path
	String file = dataDir + "SB0014.pdf";

	// Create api instance
	AsposeOCRPdf api = new AsposeOCRPdf();

	// Set recognition options
	DocumentRecognitionSettings settings = new DocumentRecognitionSettings(2);
	settings.setLanguage(Language.Eng);

	// Get result list
	ArrayList<RecognitionResult> result = api.RecognizePdf(file, settings);

	// print result		
	for(RecognitionResult r: result) {
		printResult(r);
	}

	// ExEnd:1
	System.out.println("OCRRecognizePdf: execution complete");
}

 static void printResult(RecognitionResult result) {
    	//TEXT
    	System.out.println("TEXT:\n" + result.recognitionText);
    	
    	
    	
   
     	
    	//LINES
    
    	
    	//WARNINGS
    	System.out.println("WARNINGS:");
    	for (String warning : result.warnings){
    		System.out.print(warning);
    	}
    
    }

}

brett5d90e · January 12, 2023, 2:44am

I also tried converting the PDF to a multipage tiff.
I also requested a trial key, but nothing seems to return any actual text from my documents.

asad.ali · January 12, 2023, 6:02pm

@brett5d90e

We were unable to download the PDF file. Can you please download it and attach it here with your reply so that we can test the scenario in our environment and address it accordingly?

brett5d90e · January 12, 2023, 9:30pm

I’m not sure I need OCR to get just the text from the PDF, but I tried the aspose cloud OCR and PDF to text and get what I would like, however I cannot duplicate it with the Java SDK

HB0037.pdf (163.3 KB)

I have also tried TextAbsorder, but cannot get the PDF text from.

package com.aspose.ocr.examples.OcrFeatures;

import com.aspose.ocr.AsposeOCR;
import com.aspose.pdf.*;
import com.aspose.ocr.examples.Utils;
import com.aspose.ocr.examples.License.SetLicense;

import javax.imageio.ImageIO;

import java.io.IOException;

public class ExtractFromAllPages {

public static void main(String[] args) {
	SetLicense.main(null);
// The path to the documents directory.
String _dataDir = "/Users/brettkokinadis/git/Aspose.OCR-for-Java/Examples/src/main/resources/";
String filePath = _dataDir + "a14.pdf";

// Open document
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(filePath);

// Create TextAbsorber object to extract text
com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();

// Accept the absorber for all the pages
pdfDocument.getPages().accept(textAbsorber);

// Get the extracted text
String extractedText = textAbsorber.getText();                

    //java.io.FileWriter writer = new java.io.FileWriter(_dataDir + "extracted-text.txt", true);
    // Write a line of text to the file
    //writer.write(extractedText);            
    // Close the stream
    //writer.close();
	System.out.println("Result: " + extractedText);

}
}

brett5d90e · January 12, 2023, 9:58pm

Another failed attempt: Results:
Evaluation Only. Created with Aspose.PDF. Copyright 2002-2022 Aspose Pty Ltd.
1

package com.aspose.ocr.examples.OcrFeatures;

import com.aspose.pdf.Document;

import com.aspose.pdf.MemoryCleaner;

import com.aspose.pdf.TextAbsorber;

public class ExtractFromAllPages {

public static void main(String[] args) throws Exception {

// Open document

Document pdfDocument = new Document("/Users/brettkokinadis/git/Aspose.OCR-for-Java/Examples/src/main/resources/a14.pdf");

// Create TextAbsorber object to extract text

TextAbsorber textAbsorber = new TextAbsorber();

// Accept the absorber for all the pages

pdfDocument.getPages().accept(textAbsorber);

// Get the extracted text

String extractedText = textAbsorber.getText();

// Create a writer and open the file

java.io.FileWriter writer = new java.io.FileWriter(new java.io.File("/Users/brettkokinadis/git/Aspose.OCR-for-Java/Examples/src/main/resources/Extracted_text1.txt"));

writer.write(extractedText);

// Write a line of text to the file

// tw.WriteLine(extractedText);

// Close the stream

writer.close();

/*

// ExStart:Info1

// Accept the absorber for particular PDF page

pdfDocument.getPages().get_Item(1).accept(textAbsorber);

// ExEnd:Info1

// ExStart:Info2

MemoryCleaner.clear();

// ExEnd:Info2

*/

}

asad.ali · January 13, 2023, 1:13am

@brett5d90e

We are checking it and will get back to you shortly.

brett5d90e · January 13, 2023, 2:01am

Thanks. This worked, however I get odd things in the extract that I dont see in the SaaS Aspose demo. How do I remove the:

" ] = delete = ]"
] = delete = ] 19 E. “division” means the energy conservation and

“underscored material = new = material underscored material [bracketed”
Source pdf attached.
TextAbosrber output attached.
code below:
Text.txt.zip (9.0 KB)
HB0037.pdf (163.3 KB)

package asposepfmaven;
import com.aspose.pdf.*;

public class go {

public static void main(String[] args) {
	// TODO Auto-generated method stub
	com.aspose.pdf.License license = new com.aspose.pdf.License();
	String file = "/Users/brettkokinadis/dev/eclipse/AsposeOCR/asposepfmaven/src/main/resources/Aspose.PDF.Java.lic"; 
	try {
		license.setLicense(file);
	} catch (Exception e) {
		// TODO Auto-generated catch block
		e.printStackTrace();
	}

	//Check license
	boolean resLicense = License.isInternalFIPSSecurity();
	System.out.println("License is set: " + resLicense);
	// ExEnd:1
		    // Open document
		    Document pdfDocument = new Document("/Users/brettkokinadis/dev/eclipse/AsposeOCR/asposepfmaven/src/main/resources/HB0037.pdf");
		    // Create TextAbsorber object to extract text
		    TextAbsorber textAbsorber = new TextAbsorber();
		    // Accept the absorber for all the pages
		    pdfDocument.getPages().accept(textAbsorber);
		    // Get the extracted text
		    String extractedText = textAbsorber.getText();
		    // Create a writer and open the file
			System.out.println (extractedText);
		    
		  /*  java.io.FileWriter writer = new java.io.FileWriter(new java.io.File("Extracted_text.txt"));
		    writer.write(extractedText);
		    // Write a line of text to the file
		    // tw.WriteLine(extractedText);
		    // Close the stream
		    writer.close();
		/*
		    // ExStart:Info1
		    // Accept the absorber for particular PDF page
		    pdfDocument.getPages().get_Item(1).accept(textAbsorber);
		    // ExEnd:Info1

		    // ExStart:Info2
		    MemoryCleaner.clear();
		    // ExEnd:Info2
		     * 
		     */
		  }
		
}

asad.ali · January 13, 2023, 7:05pm

@brett5d90e

We believe that this information is present in the PDF that you are processing. Please check the attached screenshot: image.png (83.4 KB)

brett5d90e · January 13, 2023, 7:25pm

I see! Thanks. Is there a way to ignore vertical text extraction like [bracketed material] from the left margin?

asad.ali · January 13, 2023, 10:18pm

@brett5d90e

You can first find and replace the text (that you want to ignore) with an empty string and then extract all text from the PDF in order to get desired results.