Hi,
I need to convert pdf documents to the Word document format. These documents are mostly scanned pdfs. I use the following code for the conversion:
public String saveAsDocx() {
    String fn = parent.getNewFileName(file, "DOCX");
    DocSaveOptions saveOption = new DocSaveOptions();
    saveOption.setFormat(DocSaveOptions.DocFormat.DocX);
    saveOption.setMode(DocSaveOptions.RecognitionMode.Flow);
    saveOption.setRecognizeBullets(true);
    document.save(fn, saveOption);
    parent.addResult(fn, false);
    this.processedFile.otherFormats.add(new ProcessedFile(fn));
    return fn;
}
where document is of type com.aspose.pdf.Document
The output I get is a word document, where each page contains only an image with the original pdf content. I understand that if the text of the scanned pdf is not well recognizable this is a valid output. However, when I try to extract the text from the pdf document it works well. For the text extraction I use the following code:
    public String saveAsText() throws Exception {
    String fn = parent.getNewFileName(file, "TXT");
   
    try (Writer writer = Files.newBufferedWriter(Paths.get(fn), StandardCharsets.UTF_8)) {
        for (int i = 0; i < document.getPages().size(); i++) {
            com.aspose.pdf.TextAbsorber textAbsorber = new com.aspose.pdf.TextAbsorber();
            textAbsorber.getExtractionOptions().setFormattingMode(TextExtractionOptions.TextFormattingMode.Pure);
            document.getPages().get_Item(i + 1).accept(textAbsorber);
            String extractedText = textAbsorber.getText();
            writer.write(extractedText);
            writer.write("\n ----------------- PAGE --------------- \n");
            writer.flush();
        }
    }
    parent.addResult(fn, false);
    this.processedFile.otherFormats.add(new ProcessedFile(fn));
    return fn;
}
And the output of this method is a txt file with the text of the original pdf document and there are no errors in it.
So the question is - how to convert the pdf to docx when I know that the text is well extractable? I can also provide the document if it is necessary.
Thanks!
This Topic is created by sohail.aspose using the Email to Topic plugin.