I am using below code to convert a OCR PDF to word,
Document asposeDocument = new Document(outFilePath);
String newOutFilePath = basePath + outFileName + "." +SaveFormat.Doc;
asposeDocument.save(newOutFilePath, SaveFormat.Doc);
how ever, the word document generated sucessfully but the text layer lost.
@jeffreyjia2046
Please try using below method to convert scanned OCR’d PDFs into DOCX and let us know if you still face any issues:
import com.aspose.pdf.*;
public class ConvertPdfToDocx {
public static void main(String[] args) {
// Specify the input PDF file
String inputFilePath = "support_2.pdf";
String outputFilePath = "output.docx";
// Load the PDF document
Document pdfDocument = new Document(inputFilePath);
// Iterate through pages of the PDF
for (Page page : pdfDocument.getPages()) {
// Initialize the TextFragmentAbsorber
TextFragmentAbsorber absorber = new TextFragmentAbsorber();
page.accept(absorber);
// Process each found text fragment
for (TextFragment fragment : absorber.getTextFragments()) {
fragment.getTextState().setRenderingMode(TextRenderingMode.FillText);
fragment.getTextState().setFont(FontRepository.findFont("Arial"));
}
// Clear images from the page resources
page.getResources().getImages().clear();
}
// Set up the DocSaveOptions
DocSaveOptions saveOptions = new DocSaveOptions();
saveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
saveOptions.setMode(DocSaveOptions.RecognitionMode.EnhancedFlow);
saveOptions.setRelativeHorizontalProximity(2.5f);
saveOptions.setRecognizeBullets(true);
// Save the PDF as a DOCX file
pdfDocument.save(outputFilePath, saveOptions);
}
}
I got below exception:
Exception in thread “main” class com.aspose.pdf.internal.ms.System.lh: Culture Name: en-HK is not a supported culture
com.aspose.pdf.internal.l70if.lh.lt(Unknown Source)
com.aspose.pdf.internal.l70if.lh.(Unknown Source)
com.aspose.pdf.internal.l70if.lh.lI(Unknown Source)
com.aspose.pdf.internal.l72v.l1v.lk(Unknown Source)
com.aspose.pdf.internal.l70if.lh.lu(Unknown Source)
com.aspose.pdf.internal.ms.System.l10l.lf(Unknown Source)
com.aspose.pdf.internal.l8t.l1t$lI.lI(Unknown Source)
com.aspose.pdf.internal.l11n.le.lI(Unknown Source)
com.aspose.pdf.internal.l11t.l0u.(Unknown Source)
com.aspose.pdf.internal.l11t.l0u.(Unknown Source)
com.aspose.pdf.internal.l8y.lf.lb(Unknown Source)
com.aspose.pdf.internal.l11t.l0p.(Unknown Source)
com.aspose.pdf.internal.l8y.lf.le(Unknown Source)
com.aspose.pdf.internal.l3v.l1f.lI(Unknown Source)
com.aspose.pdf.internal.l3v.l1f.(Unknown Source)
com.aspose.pdf.ADocument.lf(Unknown Source)
com.aspose.pdf.ADocument.(Unknown Source)
com.aspose.pdf.Document.(Unknown Source)
ConvertPdfToDocx.main(ConvertPdfToDocx.java:10)
at com.aspose.pdf.internal.l70if.lh.lt(Unknown Source)
at com.aspose.pdf.internal.l70if.lh.(Unknown Source)
at com.aspose.pdf.internal.l70if.lh.lI(Unknown Source)
at com.aspose.pdf.internal.l72v.l1v.lk(Unknown Source)
at com.aspose.pdf.internal.l70if.lh.lu(Unknown Source)
at com.aspose.pdf.internal.ms.System.l10l.lf(Unknown Source)
at com.aspose.pdf.internal.l8t.l1t$lI.lI(Unknown Source)
at com.aspose.pdf.internal.l11n.le.lI(Unknown Source)
at com.aspose.pdf.internal.l11t.l0u.(Unknown Source)
at com.aspose.pdf.internal.l11t.l0u.(Unknown Source)
at com.aspose.pdf.internal.l8y.lf.lb(Unknown Source)
at com.aspose.pdf.internal.l11t.l0p.(Unknown Source)
at com.aspose.pdf.internal.l8y.lf.le(Unknown Source)
at com.aspose.pdf.internal.l3v.l1f.lI(Unknown Source)
at com.aspose.pdf.internal.l3v.l1f.(Unknown Source)
at com.aspose.pdf.ADocument.lf(Unknown Source)
at com.aspose.pdf.ADocument.(Unknown Source)
at com.aspose.pdf.Document.(Unknown Source)
at ConvertPdfToDocx.main(ConvertPdfToDocx.java:10)
@jeffreyjia2046
Can you please share your sample PDF document along with the environment information where you are using the code? We will test the scenario in our environment and address it accordingly.
@jeffreyjia2046
We could not replicate the exception with 24.10 version of the API but we noticed that the output DOCX was not valid.
We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.
Issue ID(s): PDFJAVA-44552
You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.