Problem while converting pdf to docx


#1

Hi,

We are trying to convert searchable pdf to word. But the converted word document does not contain text rather it consist of images. When i open the pdf , i can search text in it. Also we have written a code to check whether pdf is searchable (contains text) or not :

Code to check pdf is searchable -

private boolean checkIfSearchablePDF(InputStream originalInputStream) {

    ByteArrayOutputStream byteArrayOutputStream = new ByteArrayOutputStream();
    PdfExtractor extractor = new PdfExtractor();
    extractor.bindPdf(originalInputStream);
    extractor.setStartPage(1);
    extractor.setEndPage(4);
    extractor.extractText();
    extractor.getText(byteArrayOutputStream);

    if (byteArrayOutputStream.size() > 300) {
        extractor.close();
        return true;
    }
    extractor.close();
    return false;

}

–Logic : if byte array length is greater than 300 , we assume that it contains text.

Code to convert pdf to word :

            com.aspose.pdf.Document pdfDocument = new com.aspose.pdf.Document(inputStream);
            String docPrefix = UUID.randomUUID().toString() + "temp_file";
            tempFile = File.createTempFile(docPrefix, ".docx");

            com.aspose.pdf.DocSaveOptions saveOptions = new com.aspose.pdf.DocSaveOptions();
            saveOptions.setFormat(com.aspose.pdf.DocSaveOptions.DocFormat.DocX);
            saveOptions.setMode(com.aspose.pdf.DocSaveOptions.RecognitionMode.Flow);
            saveOptions.setRelativeHorizontalProximity(2.5f);
            saveOptions.setRecognizeBullets(true);
            pdfDocument.save(tempFile.getAbsolutePath(), saveOptions);

            document = new Document(tempFile.getAbsolutePath());

Attaching the documents : pdf as well converted word :
documents.zip (5.4 MB)


#2

@saurabh.arora

Thank you for contacting support.

We have worked with the data shared by you and have been able to reproduce the issue in our environment. A ticket with ID PDFJAVA-38623 has been logged in our issue management system for further investigation and resolution. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.


#3

Thanks Farhan for the reply.

Can you please tell us the time it will take as we are already live with Aspose in production.


#4

@saurabh.arora

Please note that issue has been logged recently in our issue management system and is pending for analysis. Issues in free support model have low priority and are resolved on first come first serve basis. We will surely inform you as soon as we have some definite updates regarding issue resolution. Please spare us little time.

We are sorry for the inconvenience.