Hi,
We are trying to convert searchable pdf to word. But the converted word document does not contain text rather it consist of images. When i open the pdf , i can search text in it. Also we have written a code to check whether pdf is searchable (contains text) or not :
Code to check pdf is searchable -
private boolean checkIfSearchablePDF(InputStream originalInputStream) {
ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
PdfExtractor extractor = new PdfExtractor();
extractor.bindPdf(originalInputStream);
extractor.setStartPage(1);
extractor.setEndPage(4);
extractor.extractText();
extractor.getText(byteArrayOutputStream);
if (byteArrayOutputStream.size() > 300) {
extractor.close();
return true;
}
extractor.close();
return false;
}
–Logic : if byte array length is greater than 300 , we assume that it contains text.
Code to convert pdf to word :
com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(inputStream);
String docPrefix = UUID.randomUUID().toString() + "temp_file";
tempFile = File.createTempFile(docPrefix, ".docx");
com.aspose.pdf.DocSaveOptions saveOptions = new com.aspose.pdf.DocSaveOptions();
saveOptions.setFormat(com.aspose.pdf.DocSaveOptions.DocFormat.DocX);
saveOptions.setMode(com.aspose.pdf.DocSaveOptions.RecognitionMode.Flow);
saveOptions.setRelativeHorizontalProximity(2.5f);
saveOptions.setRecognizeBullets(true);
pdfDocument.save(tempFile.getAbsolutePath(), saveOptions);
document = new Document(tempFile.getAbsolutePath());
Attaching the documents : pdf as well converted word :
documents.zip (5.4 MB)