Not applying polish characters on output PDF after OCR

kamilszurgot · March 29, 2024, 9:58am

Hello,
I’m using the Apose.OCR library in Java to perform OCR on some scanned multipage PDF documents. I’m using the SaveMultipageDocument method to create my output document. It works pretty good but I noticed that when I open the output document and highlight some text that contains typical polish letters and copy them to a text editor, i only get a “?” instead of the polish letter like for example “ę”. I checked if maybe the text that the library recognizes also contains the question marks but no, it correctly recognizes polish letters. I use the Eclipse IDE to run the code and Adobe Acrobat Reader to open the PDF files, I’m uploading a sample PDF and also here is my sample code:
Wyspa_skan.pdf (466.0 KB)

public void makeSearchablePdf(InputStream inputStream) throws AsposeOCRException, IOException {
License.setLicense(path_to_license_file);
		AsposeOCR api = new AsposeOCR();
		PreprocessingFilter filters = new PreprocessingFilter();
		RecognitionSettings settings = new RecognitionSettings();
		
		filters.add(PreprocessingFilter.AutoSkew());
		settings.setLanguage(Language.Pol);
		settings.setDetectAreasMode(DetectAreasMode.DOCUMENT);
		
		OcrInput input = new OcrInput(InputType.PDF, filters);
		
		input.add(inputStream);
		
		ArrayList<RecognitionResult> res = api.Recognize(input, settings);
		
		res.forEach((result) -> {
			System.out.println(result.recognitionText);
		});
		AsposeOCR.SaveMultipageDocument(path_to_output_file, Format.Pdf, res);
}

asad.ali · March 29, 2024, 6:05pm

@kamilszurgot

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): OCRJAVA-364

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.

kamilszurgot · April 2, 2024, 12:18pm

@asad.ali
Thank you for the reply. I tried to do some work by myself and found the GitHub repositories with some examples for Aspose.OCR for Java and also for .NET. On both repositories I noticed a table with “Supported characters”. The .NET version of the table contains the polish symbols and the Java version doesn’t contain them (it only contains “ó”). Is it on purpose that only the .NET version can work with those characters? Sadly the lack of working with polish characters in Java really makes buying the full license impossible.

asad.ali · April 2, 2024, 8:45pm

@kamilszurgot

We try our best to keep features similar across every flavor of the API e.g. .NET/Java. The ticket is actually logged for analyzing and investigating this whole scenario and gather as much details as we can to offer a solution to this issue. We have recorded your concerns as well and will surely inform you once we make some progress towards ticket resolution. Please be patient and spare us some time.

We are sorry for the inconvenience.

kamilszurgot · April 18, 2024, 9:49am

@asad.ali
Hi, I see that the status of the ticket is “Resolved”. Can I get an update on this subject?

asad.ali · April 18, 2024, 7:12pm

@kamilszurgot

We are glad to inform you that the issue has been resolved in 24.4 version of the API.

kamilszurgot · April 22, 2024, 7:53am

@asad.ali
Thank you guys for the help. It does work now