Convert PDF to DOCX - no text is extracted

lucka · May 24, 2018, 10:54am

Hi,

I need to convert pdf documents to the Word document format. These documents are mostly scanned pdfs. I use the following code for the conversion:

public String saveAsDocx() {

    String fn = parent.getNewFileName(file, "DOCX");

    DocSaveOptions saveOption = new DocSaveOptions();
    saveOption.setFormat(DocSaveOptions.DocFormat.DocX);
    saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);

    saveOption.setRecognizeBullets(true);

    document.save(fn, saveOption);

    parent.addResult(fn, false);

    this.processedFile.otherFormats.add(new ProcessedFile(fn));

    return fn;

}

where document is of type com.aspose.pdf.Document

The output I get is a word document, where each page contains only an image with the original pdf content. I understand that if the text of the scanned pdf is not well recognizable this is a valid output. However, when I try to extract the text from the pdf document it works well. For the text extraction I use the following code:

    public String saveAsText() throws Exception {

    String fn = parent.getNewFileName(file, "TXT");
   
    try (Writer writer = Files.newBufferedWriter(Paths.get(fn), StandardCharsets.UTF_8)) {
        for (int i = 0; i < document.getPages().size(); i++) {

            com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
            textAbsorber.getExtractionOptions().setFormattingMode(TextExtractionOptions.TextFormattingMode.Pure);
            document.getPages().get_Item(i + 1).accept(textAbsorber);
            String extractedText = textAbsorber.getText();

            writer.write(extractedText);
            writer.write("\n ----------------- PAGE --------------- \n");
            writer.flush();

        }
    }

    parent.addResult(fn, false);

    this.processedFile.otherFormats.add(new ProcessedFile(fn));

    return fn;

}

And the output of this method is a txt file with the text of the original pdf document and there are no errors in it.
So the question is - how to convert the pdf to docx when I know that the text is well extractable? I can also provide the document if it is necessary.

Thanks!

This Topic is created by sohail.aspose using the Email to Topic plugin.

imran.rafique · May 24, 2018, 3:28pm

@lucia.becvarova,

Please send us your source PDF and the complete code along with the snapshot of the problematic area of text. We will investigate your scenario in our environment and share our findings with you.

lucka · May 25, 2018, 8:35am

Hi,
the entire code I am using is this:

public void testConvertToDocx() throws Exception {

    File file = new File ("C:\\Users\\Lucka\\Documents\\Datlowe\\Officer\\in\\test.pdf");
    String outputPath = "C:\\Users\\Lucka\\Documents\\Datlowe\\Officer\\out";

    Document document = new Document(file.getAbsolutePath());

    // convert to txt

    String fn = outputPath + "\\output.txt";

    try (Writer writer = Files.newBufferedWriter(Paths.get(fn), StandardCharsets.UTF_8)) {
        for (int i = 0; i < document.getPages().size(); i++) {

            com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
            textAbsorber.getExtractionOptions().setFormattingMode(TextExtractionOptions.TextFormattingMode.Pure);
            document.getPages().get_Item(i + 1).accept(textAbsorber);
            String extractedText = textAbsorber.getText();


            writer.write(extractedText);
            writer.write("\n ----------------- PAGE --------------- \n");
            writer.flush();

        }
    }

    // convert to docx

    fn = outputPath + "\\output.docx";

    DocSaveOptions saveOption = new DocSaveOptions();
    saveOption.setFormat(DocSaveOptions.DocFormat.DocX);
    saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);
    saveOption.setRecognizeBullets(true);

    document.save(fn, saveOption);
}

The original pdf I am trying to convert is the ‘test.pdf’.
The output I get (both txt and docx) are in the ‘out.zip’. The text file contains the text of the original pdf file (very well recognized) and the word document contains only images with the content (no text is recognized).

test.pdf (596.4 KB)out.zip (2.7 MB)

imran.rafique · May 25, 2018, 6:23pm

@lucka,

We managed to replicate the said behavior in our environment. An investigation has been logged under the ticket ID PDFNET-44762 in our bug tracking system. We have linked your post to this ticket and will keep you informed regarding any available updates.