Convert PDF To Text Using JAVA - Rendering junk chars for Files with Tables, images

abusaleham · March 8, 2022, 9:52am

Aspose PDF for JAVA : 21.12 is producing junk characters for Input PDF Files having form type of content which includes [Check boxes, Text fields, images etc] - Please refer to files attached 4_83269c06-792a-4e2d-8544-c52edf40f31d.pdf (83.6 KB)
1_5af5ef10-c247-4524-9280-0d809caeef98.pdf (119.5 KB)

tahir.manzoor · March 8, 2022, 1:26pm

@abusaleham

To ensure a timely and accurate response, please attach the following resources here for testing:

Please attach the output PDF file that shows the undesired behavior.
Please create a sample Java application (source code without compilation errors) that helps us to reproduce your problem on our end and attach it here for testing.

As soon as you get these pieces of information ready, we will start investigation into your issue and provide you more information. Thanks for your cooperation.

PS: To attach these resources, please zip and upload them.

abusaleham · March 8, 2022, 2:06pm

Please find details and attached sample input and output:
public static void ConvertPDFtoText()
{
Path pdfFile = Paths.get(_dataDir.toString(), “Sample-Report.pdf”);
Path textFile = Paths.get(_dataDir.toString(), “Sample-Report.txt”);
Document pdfDocument = new Document(pdfFile)
File textFile = new File(textFile)
OutputStream textStream = new FileOutputStream(textFile)
TextDevice textDevice = new TextDevice(Charset.forName(“UTF-8”))
TextExtractionOptions textExtOptions = new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw)
textDevice.setExtractionOptions(textExtOptions)
int pageNumber = 1
for (Page page : (Iterable) pdfDocument.pages)
{
if (paginator)
{
textStream.write((String.format(paginator, pageNumber)).bytes)
}
textStream.write((System.getProperty(“line.separator”)).bytes)
pageNumber++

	textDevice.process(page, textStream)
       }

        	textStream.close()

}
SampleFiles.7z (126.0 KB)

tahir.manzoor · March 8, 2022, 4:06pm

@abusaleham

We have logged this problem in our issue tracking system as PDFJAVA-41380. You will be notified via this forum thread once this issue is resolved.

We apologize for your inconvenience.

abusaleham · March 30, 2022, 6:55am

Hi,

We are from Thomson Reuters and there is Paid support subscription tie up with ASPOSE, can you please look into options to expedite in fixing above issue.

tahir.manzoor · March 30, 2022, 10:49am

@abusaleham

It is to inform you that the issue which you are facing is actually not a bug in Aspose.PDF. So, we have closed this issue (PDFJAVA-41380) as ‘Not a Bug’.

The correct extraction of text is impossible for the document. Adobe Acrobat returns similar result.

abusaleham · March 31, 2022, 10:54am

Can you please share result of Adobe Acrobat and is there any way to handle via Aspose related exception instead of returning Junk chars as output ?

tahir.manzoor · March 31, 2022, 3:12pm

@abusaleham

Please open the PDF file in Adobe Acrobat reader/writer and copy the content using Ctrl + A and Ctrl +C. You can paste the content in notepad to see the same result as shown in attached image.

image.png (211.7 KB)

abusaleham · April 1, 2022, 7:34am

Can you check and confirm on other question related to handling of ASPOSE related exception for this kind of scenarios ?

tahir.manzoor · April 1, 2022, 12:59pm

@abusaleham

Unfortunately, there is no API to throw exception for this case. In this case, Aspose.PDF mimics the Adobe.