Getting Empty Searchable PDF from Aspose.Words to Aspose.OCR

Hi Team,

We wanted to convert doc and docx to searchable pdf, but below code resulted in empty pdf. Kindly let us know if we are missing any settings

public static void wordtopdf()
    {
        AsposeOcr api = new AsposeOcr();
        License lic = new License();
        lic.SetLicense("Aspose.OCR.NET.lic");
        List<RecognitionResult> recognitionResult = new List<RecognitionResult>();

        string path = @"D:\OCR\TIFFs\1Page.docx";
        // Initialize PDF output stream
        using (System.IO.MemoryStream DocInputStream = new MemoryStream(File.ReadAllBytes(path)))
        {
            using (MemoryStream WritePdfStream = new MemoryStream())
            {

                Aspose.Words.Loading.LoadOptions op = new Aspose.Words.Loading.LoadOptions();
                op.Encoding = System.Text.Encoding.UTF8;
                op.LoadFormat = Aspose.Words.LoadFormat.Docx;

                var doc = new Document(DocInputStream, op);

                doc.Save(WritePdfStream, Aspose.Words.SaveFormat.Pdf);

                recognitionResult.AddRange(api.RecognizePdf(WritePdfStream, new DocumentRecognitionSettings() { Language = Language.Eng, AllowedCharacters = CharactersAllowedType.ALL, UpscaleSmallFont = true }));

                AsposeOcr.SaveMultipageDocument(@"D:\OCR\TIFFs\1DocPageSearchble.pdf", Aspose.OCR.SaveFormat.Pdf, recognitionResult);


            }
        }
    }

1DocPageSearchble.pdf (1.2 KB)
1Page.docx (12.0 KB)

@Gpatil

The attached word document already has searchable text. Would you please share why you want to process it using Aspose.OCR and generate PDF document? You can simply convert such files to PDF using Aspose.Words and obtained a searchable PDF because input Word document already has text content instead of images:

Aspose.Words.Document doc = new Aspose.Words.Document(dataDir + "input.docx");
doc.Save(dataDir + "output.pdf", Aspose.Words.SaveFormat.Pdf);

Hi @asad.ali

The attached document was just a sample to start with, we receive docx files of multiple page which has images within, We wanted to convert those doc to searchable pdf so if the images in the doc has text should be searchable too.
Also Our application convert most of the images to searchable pdf format, So we wanted to that consistency across all major formats.
Like this
1PagewithImage.docx (34.7 KB)

Hi @asad.ali
As suggested I did tried this too. but I am getting empty or incorrect pdf. Seems Memory overload is not working as expected when given input to recognize

    public static void wordtopdf()
    {
        AsposeOcr api = new AsposeOcr();
        License lic = new License();


        Aspose.Words.License licw = new Aspose.Words.License();

        lic.SetLicense("Aspose.Total.NET.lic");
        licw.SetLicense("Aspose.Total.NET.lic");

        List<RecognitionResult> recognitionResult = new List<RecognitionResult>();

        string path = @"D:\OCR\TIFFs\2PagewithImage.docx";
        // Initialize PDF output stream
        using (System.IO.MemoryStream DocInputStream = new MemoryStream(File.ReadAllBytes(path)))
        {
            using (MemoryStream WritePdfStream = new MemoryStream())
            {
                Aspose.Words.Loading.LoadOptions op = new Aspose.Words.Loading.LoadOptions();
                op.Encoding = System.Text.Encoding.UTF8;
                op.LoadFormat = Aspose.Words.LoadFormat.Docx;

                //Saving Normal PDF
                Document docDisk = new Document(DocInputStream, op);
                docDisk.Save(@"D:\OCR\TIFFs\NonSearchable_Disk.pdf", Aspose.Words.SaveFormat.Pdf); // <<=== This  WORKS though , but if we use this pdf for input for to make searchable it wont(we are more interested in memory overload )

                DocInputStream.Position = 0;
                Document docMem = new Document(DocInputStream, op);
                docMem.Save(WritePdfStream, Aspose.Words.SaveFormat.Pdf);  // <<=== This  might be working  but some issue 

                //Making Searchable PDF
                WritePdfStream.Position = 0;
                recognitionResult.AddRange(api.RecognizePdf(WritePdfStream, new DocumentRecognitionSettings() { Language = Language.Eng, AllowedCharacters = CharactersAllowedType.ALL, UpscaleSmallFont = true }));

                AsposeOcr.SaveMultipageDocument(@"D:\OCR\TIFFs\Searchble_Memory.pdf", Aspose.OCR.SaveFormat.Pdf, recognitionResult);


            }
        }
    }

Attaching 2 input docs for above code and the 2 output normal pdf and searchable pdf
2PagewithImage_Searchable_Memory.pdf (1.2 KB)
2PagewithImage_NonSearchable_Disk.pdf (62.2 KB)
2PagewithImage.docx (35.2 KB)
1PagewithImage_Searchable_Memory.pdf (78.6 KB)
1PagewithImage_NonSearchable_Disk.pdf (45.1 KB)
1PagewithImage.docx (34.9 KB)

@Gpatil

We are checking it and will get back to you shortly.

Hi @asad.ali

Do we have any updated on this. If you want you can split the problem stmt in 2 parts.

  1. process doc with image to searchable pdf
  2. process doc without image to searchable pdf

#2 is the most occurred scenario .Can you provide any updated on this

@Gpatil

We are checking and investigating the task as it has been logged under the ticket ID OCRNET-620 in our issue tracking system. We will be updating you in this forum thread as soon as the ticket is resolved. Please be patient and spare us some time.

We are sorry for the inconvenience.

Hi @asad.ali
Looking at the status below issue it seems to be resolved but ,When I took Aspose.Word 22.12 it is still showing same result. Could you please assist me with any new settings you have added

@Gpatil

  1. We are not specialized in extracting text from PDFs. So after using Aspose.Words you have got the PDF with combined (text and image) content.

Aspose.OCR can extract images and recognize the text on them. But we can’t extract text. The only setting you have to use is the pages number

                recognitionResult.AddRange(api.RecognizePdf(WritePdfStream, new DocumentRecognitionSettings(0,3) { Language = Language.Eng, AllowedCharacters = CharactersAllowedType.ALL, UpscaleSmallFont = true }));

Your image places on the second page. The result PDF is attached

  1. What we can advise to get one completed PDF with text and image using Aspose.OCR - convert your .docx file into images and then recognize images. For example
  using (System.IO.MemoryStream DocInputStream = new MemoryStream(File.ReadAllBytes(path)))
            {
                using (MemoryStream WritePdfStream = new MemoryStream())
                {
                    Aspose.Words.Loading.LoadOptions op = new Aspose.Words.Loading.LoadOptions();
                    op.Encoding = System.Text.Encoding.UTF8;
                    op.LoadFormat = Aspose.Words.LoadFormat.Docx;

                    //Saving image
                    Document docDisk = new Document(DocInputStream, op);
                    docDisk.Save(@"D:\imgs\ISSUES\NET620\res\NonSearchable_Disk.tiff", Aspose.Words.SaveFormat.Tiff); // <<=== for example TIFF is multipage format 
                    recognitionResult.AddRange(api.RecognizeTiff(@"D:\imgs\ISSUES\NET620\res\NonSearchable_Disk.tiff", new DocumentRecognitionSettings(0,3) { Language = Language.Eng, DetectAreasMode = DetectAreasMode.TABLE})); // TABLE mode is better for mixed text and table content

                    AsposeOcr.SaveMultipageDocument(@"D:\imgs\ISSUES\NET620\res\Searchble_Memory.pdf", Aspose.OCR.SaveFormat.Pdf, recognitionResult);


                }
            }

or you can choose the way. Simply take into account that Aspose.OCR is the image recognition library, and for now, we can’t extract text from PDF which is already Searchable or mixed. Searchble_Memory.pdf (110.7 KB)
Searchble_Memory1.pdf (78.9 KB)

Hi @asad.ali Thank you, I will try this soon

@Gpatil

Sure, please take your time to test the case and feel free to create a new topic in case you need further assistance.