OCR PDF documents convert to word lost text layer

jeffreyjia2046 · December 2, 2024, 3:37am

I am using below code to convert a OCR PDF to word,

            Document asposeDocument = new Document(outFilePath);
            String newOutFilePath = basePath + outFileName + "." +SaveFormat.Doc;
            asposeDocument.save(newOutFilePath, SaveFormat.Doc);

how ever, the word document generated sucessfully but the text layer lost.

asad.ali · December 2, 2024, 9:45am

@jeffreyjia2046

Please try using below method to convert scanned OCR’d PDFs into DOCX and let us know if you still face any issues:

import com.aspose.pdf.*;

public class ConvertPdfToDocx {
    public static void main(String[] args) {
        // Specify the input PDF file
        String inputFilePath = "support_2.pdf";
        String outputFilePath = "output.docx";

        // Load the PDF document
        Document pdfDocument = new Document(inputFilePath);

        // Iterate through pages of the PDF
        for (Page page : pdfDocument.getPages()) {
            // Initialize the TextFragmentAbsorber
            TextFragmentAbsorber absorber = new TextFragmentAbsorber();
            page.accept(absorber);

            // Process each found text fragment
            for (TextFragment fragment : absorber.getTextFragments()) {
                fragment.getTextState().setRenderingMode(TextRenderingMode.FillText);
                fragment.getTextState().setFont(FontRepository.findFont("Arial"));
            }

            // Clear images from the page resources
            page.getResources().getImages().clear();
        }

        // Set up the DocSaveOptions
        DocSaveOptions saveOptions = new DocSaveOptions();
        saveOptions.setFormat(DocSaveOptions.DocFormat.DocX);
        saveOptions.setMode(DocSaveOptions.RecognitionMode.EnhancedFlow);
        saveOptions.setRelativeHorizontalProximity(2.5f);
        saveOptions.setRecognizeBullets(true);

        // Save the PDF as a DOCX file
        pdfDocument.save(outputFilePath, saveOptions);
    }
}

jeffreyjia2046 · December 2, 2024, 10:20am

I got below exception:
Exception in thread “main” class com.aspose.pdf.internal.ms.System.lh: Culture Name: en-HK is not a supported culture
com.aspose.pdf.internal.l70if.lh.lt(Unknown Source)
com.aspose.pdf.internal.l70if.lh.(Unknown Source)
com.aspose.pdf.internal.l70if.lh.lI(Unknown Source)
com.aspose.pdf.internal.l72v.l1v.lk(Unknown Source)
com.aspose.pdf.internal.l70if.lh.lu(Unknown Source)
com.aspose.pdf.internal.ms.System.l10l.lf(Unknown Source)
com.aspose.pdf.internal.l8t.l1t$lI.lI(Unknown Source)
com.aspose.pdf.internal.l11n.le.lI(Unknown Source)
com.aspose.pdf.internal.l11t.l0u.(Unknown Source)
com.aspose.pdf.internal.l11t.l0u.(Unknown Source)
com.aspose.pdf.internal.l8y.lf.lb(Unknown Source)
com.aspose.pdf.internal.l11t.l0p.(Unknown Source)
com.aspose.pdf.internal.l8y.lf.le(Unknown Source)
com.aspose.pdf.internal.l3v.l1f.lI(Unknown Source)
com.aspose.pdf.internal.l3v.l1f.(Unknown Source)
com.aspose.pdf.ADocument.lf(Unknown Source)
com.aspose.pdf.ADocument.(Unknown Source)
com.aspose.pdf.Document.(Unknown Source)
ConvertPdfToDocx.main(ConvertPdfToDocx.java:10)
at com.aspose.pdf.internal.l70if.lh.lt(Unknown Source)
at com.aspose.pdf.internal.l70if.lh.(Unknown Source)
at com.aspose.pdf.internal.l70if.lh.lI(Unknown Source)
at com.aspose.pdf.internal.l72v.l1v.lk(Unknown Source)
at com.aspose.pdf.internal.l70if.lh.lu(Unknown Source)
at com.aspose.pdf.internal.ms.System.l10l.lf(Unknown Source)
at com.aspose.pdf.internal.l8t.l1t$lI.lI(Unknown Source)
at com.aspose.pdf.internal.l11n.le.lI(Unknown Source)
at com.aspose.pdf.internal.l11t.l0u.(Unknown Source)
at com.aspose.pdf.internal.l11t.l0u.(Unknown Source)
at com.aspose.pdf.internal.l8y.lf.lb(Unknown Source)
at com.aspose.pdf.internal.l11t.l0p.(Unknown Source)
at com.aspose.pdf.internal.l8y.lf.le(Unknown Source)
at com.aspose.pdf.internal.l3v.l1f.lI(Unknown Source)
at com.aspose.pdf.internal.l3v.l1f.(Unknown Source)
at com.aspose.pdf.ADocument.lf(Unknown Source)
at com.aspose.pdf.ADocument.(Unknown Source)
at com.aspose.pdf.Document.(Unknown Source)
at ConvertPdfToDocx.main(ConvertPdfToDocx.java:10)

asad.ali · December 2, 2024, 7:18pm

@jeffreyjia2046

Can you please share your sample PDF document along with the environment information where you are using the code? We will test the scenario in our environment and address it accordingly.

jeffreyjia2046 · December 3, 2024, 2:08am

ocred.pdf (9.2 MB)

asad.ali · December 3, 2024, 1:08pm

@jeffreyjia2046

We could not replicate the exception with 24.10 version of the API but we noticed that the output DOCX was not valid.

We have opened the following new ticket(s) in our internal issue tracking system and will deliver their fixes according to the terms mentioned in Free Support Policies.

Issue ID(s): PDFJAVA-44552

You can obtain Paid Support Services if you need support on a priority basis, along with the direct access to our Paid Support management team.