We're sorry Aspose doesn't work properply without JavaScript enabled.

Free Support Forum - aspose.com

Getting Empty Searchable PDF from Aspose.Words to Aspose.OCR

Hi Team,

We wanted to convert doc and docx to searchable pdf, but below code resulted in empty pdf. Kindly let us know if we are missing any settings

public static void wordtopdf()
    {
        AsposeOcr api = new AsposeOcr();
        License lic = new License();
        lic.SetLicense("Aspose.OCR.NET.lic");
        List<RecognitionResult> recognitionResult = new List<RecognitionResult>();

        string path = @"D:\OCR\TIFFs\1Page.docx";
        // Initialize PDF output stream
        using (System.IO.MemoryStream DocInputStream = new MemoryStream(File.ReadAllBytes(path)))
        {
            using (MemoryStream WritePdfStream = new MemoryStream())
            {

                Aspose.Words.Loading.LoadOptions op = new Aspose.Words.Loading.LoadOptions();
                op.Encoding = System.Text.Encoding.UTF8;
                op.LoadFormat = Aspose.Words.LoadFormat.Docx;

                var doc = new Document(DocInputStream, op);

                doc.Save(WritePdfStream, Aspose.Words.SaveFormat.Pdf);

                recognitionResult.AddRange(api.RecognizePdf(WritePdfStream, new DocumentRecognitionSettings() { Language = Language.Eng, AllowedCharacters = CharactersAllowedType.ALL, UpscaleSmallFont = true }));

                AsposeOcr.SaveMultipageDocument(@"D:\OCR\TIFFs\1DocPageSearchble.pdf", Aspose.OCR.SaveFormat.Pdf, recognitionResult);


            }
        }
    }

1DocPageSearchble.pdf (1.2 KB)
1Page.docx (12.0 KB)

@Gpatil

The attached word document already has searchable text. Would you please share why you want to process it using Aspose.OCR and generate PDF document? You can simply convert such files to PDF using Aspose.Words and obtained a searchable PDF because input Word document already has text content instead of images:

Aspose.Words.Document doc = new Aspose.Words.Document(dataDir + "input.docx");
doc.Save(dataDir + "output.pdf", Aspose.Words.SaveFormat.Pdf);

Hi @asad.ali

The attached document was just a sample to start with, we receive docx files of multiple page which has images within, We wanted to convert those doc to searchable pdf so if the images in the doc has text should be searchable too.
Also Our application convert most of the images to searchable pdf format, So we wanted to that consistency across all major formats.
Like this
1PagewithImage.docx (34.7 KB)

Hi @asad.ali
As suggested I did tried this too. but I am getting empty or incorrect pdf. Seems Memory overload is not working as expected when given input to recognize

    public static void wordtopdf()
    {
        AsposeOcr api = new AsposeOcr();
        License lic = new License();


        Aspose.Words.License licw = new Aspose.Words.License();

        lic.SetLicense("Aspose.Total.NET.lic");
        licw.SetLicense("Aspose.Total.NET.lic");

        List<RecognitionResult> recognitionResult = new List<RecognitionResult>();

        string path = @"D:\OCR\TIFFs\2PagewithImage.docx";
        // Initialize PDF output stream
        using (System.IO.MemoryStream DocInputStream = new MemoryStream(File.ReadAllBytes(path)))
        {
            using (MemoryStream WritePdfStream = new MemoryStream())
            {
                Aspose.Words.Loading.LoadOptions op = new Aspose.Words.Loading.LoadOptions();
                op.Encoding = System.Text.Encoding.UTF8;
                op.LoadFormat = Aspose.Words.LoadFormat.Docx;

                //Saving Normal PDF
                Document docDisk = new Document(DocInputStream, op);
                docDisk.Save(@"D:\OCR\TIFFs\NonSearchable_Disk.pdf", Aspose.Words.SaveFormat.Pdf); // <<=== This  WORKS though , but if we use this pdf for input for to make searchable it wont(we are more interested in memory overload )

                DocInputStream.Position = 0;
                Document docMem = new Document(DocInputStream, op);
                docMem.Save(WritePdfStream, Aspose.Words.SaveFormat.Pdf);  // <<=== This  might be working  but some issue 

                //Making Searchable PDF
                WritePdfStream.Position = 0;
                recognitionResult.AddRange(api.RecognizePdf(WritePdfStream, new DocumentRecognitionSettings() { Language = Language.Eng, AllowedCharacters = CharactersAllowedType.ALL, UpscaleSmallFont = true }));

                AsposeOcr.SaveMultipageDocument(@"D:\OCR\TIFFs\Searchble_Memory.pdf", Aspose.OCR.SaveFormat.Pdf, recognitionResult);


            }
        }
    }

Attaching 2 input docs for above code and the 2 output normal pdf and searchable pdf
2PagewithImage_Searchable_Memory.pdf (1.2 KB)
2PagewithImage_NonSearchable_Disk.pdf (62.2 KB)
2PagewithImage.docx (35.2 KB)
1PagewithImage_Searchable_Memory.pdf (78.6 KB)
1PagewithImage_NonSearchable_Disk.pdf (45.1 KB)
1PagewithImage.docx (34.9 KB)

@Gpatil

We are checking it and will get back to you shortly.

Hi @asad.ali

Do we have any updated on this. If you want you can split the problem stmt in 2 parts.

  1. process doc with image to searchable pdf
  2. process doc without image to searchable pdf

#2 is the most occurred scenario .Can you provide any updated on this

@Gpatil

We are checking and investigating the task as it has been logged under the ticket ID OCRNET-620 in our issue tracking system. We will be updating you in this forum thread as soon as the ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.