Create Searchable PDF documents using Aspose.PDF for .NET - System.ArgumentException

Dear Team
I have a problem in creating "searchable PDF. for certain files. For creating searchable PDFs from scanned documents, I am using HOCR and Tesseract. For some of files I am getting an error message ‘Parameter is not valid.’

Please find below detailed stack trace

System.ArgumentException
** HResult=0x80070057**
** Message=Parameter is not valid.**
** Source=System.Drawing**
** StackTrace:**
** at System.Drawing.Bitmap.SetResolution(Single xDpi, Single yDpi)**
** at Aspose.Pdf.ImagePlacement.Save(Stream stream, ImageFormat format)**
** at #=z8_WmhwQTmNzHCFQbnoC65BcGB6$20FiG1gNAyXtkorvusJDjylcMpxQ=.#=zIBSmXow=(CallBackGetHocr #=zf4y0VK0aiMHb, Document #=zF4oCFfU=)**
** at Aspose.Pdf.Document.Convert(CallBackGetHocr callback)**
** at AsponsePDF.Program.Main(String[] args) in d:\raj chidara backup\growing_files\research\AsponsePDF\AsponsePDF\Program.cs:line 62**

Regards
Raj

@crshekharam

Thank you for contacting support.

Would you please share SSCCE code along with sample document so that we may try to reproduce and investigate it in our environment. Before sharing requested data, please ensure using Aspose.PDF for .NET 19.11.

static void Main(string[] args)
    {
        string FileIn = @"D:\Raj Chidara Backup\growing_files\Research\PDF\3.2.R Certificate of Analysis of Vanilla Flavour with Diacetyl.pdf";
        string FileOut = @"D:\Raj Chidara Backup\growing_files\Research\PDF\3.2.R Certificate of Analysis of Vanilla Flavour with Diacetyl_cp.pdf";
        Aspose.Pdf.License license = new Aspose.Pdf.License();
        license.SetLicense("Aspose.Total.lic");
        license.Embedded = true;
        Document doc = new Document(FileIn);
        doc.Convert(CallBackGetHocr);
        doc.Save(FileOut);

}
   private static string CallBackGetHocr(System.Drawing.Image img)
    {

        string dir = "D:\\Raj Chidara Backup\\growing_files\\Research\\PDF\\";
        img.Save(dir + "workingfolder\\test.jpg");
        ProcessStartInfo info = new ProcessStartInfo("d:\\Program Files\\Tesseract-OCR\\tesseract");
        info.WindowStyle = ProcessWindowStyle.Hidden;
        info.Arguments = "\"" + dir + "workingfolder\\test.jpg\" \"" + dir + "workingfolder\\out\" hocr";
        Process p = new Process();
        p.StartInfo = info;
        p.Start();
        p.WaitForExit();
        StreamReader streamReader = new StreamReader(dir +  "workingfolder\\out.hocr");
        string text = streamReader.ReadToEnd();
        streamReader.Close();
        return text;
}

3.2.R Certificate of Analysis of Vanilla Flavour with Diacetyl.pdf (391.3 KB)

@crshekharam

We have logged a ticket with ID PDFNET-47315 in our issue management system for further investigations. The ticket ID has been linked with this thread so that you will receive notification as soon as the ticket is resolved.

We are sorry for the inconvenience.

Is there any update on this issue

@crshekharam

We regret to share that earlier logged issue is not yet resolved due to other high priority tasks and issues. We will surely inform you as soon as we have some updates regarding resolution of the ticket. Please spare us some time.

We are sorry for the inconvenience.

I could generate images from current PDF(Using Aspose) and then converted images to searchable PDF (using Tessearact). Now problem is to merge this new PDF with original PDF. Do you have any facility to overlay invisible text from PDF generated by Tessearact into original PDF so that my original PDF become searchable.

@crshekharam

The above shared code snippet is for generating a searchable PDF and with your document it is generating an exception. For which, the ticket has been logged. However, if you are able to run above code snippet without any exception and still output is not being generating - would you kindly share a sample console application with us with source files. We will test the scenario in our environment and address it accordingly.

No, My original issue is not resolved. As a work around, we worked all together on new approach and code. We converted our PDF to images and then converted images to searchable PDF (using Tessearact). Now problem is to merge this new PDF with original PDF. Do you have any facility to overlay invisible text from PDF generated by Tessearact into original PDF so that my original PDF become searchable.

    public static void GenImages(String FileIn, String FileOut)
    {
        Document pdfDocument = new Document(FileIn);
        Document pdfOutDcoument = new Document();
        DateTime dt = DateTime.Now;
        Console.WriteLine("Start Time:"+dt);
        for (int pageCount = 1; pageCount <= pdfDocument.Pages.Count; pageCount++)
        {
            using (FileStream imageStream = new FileStream(@"c:\users\raja_c\downloads\hocrimages\" + "hocrimages_out.jpg" , FileMode.Create))
            {
                // Create JPEG device with specified attributes
                // Width, Height, Resolution, Quality
                // Quality [0-100], 100 is Maximum
                // Create Resolution object
                Resolution resolution = new Resolution(150);

                // JpegDevice jpegDevice = new JpegDevice(500, 700, resolution, 100);
                JpegDevice jpegDevice = new JpegDevice(resolution, 50);

                // Convert a particular page and save the image to stream
                jpegDevice.Process(pdfDocument.Pages[pageCount], imageStream);

                // Close stream
                imageStream.Close();
            }
            ProcessStartInfo info = new ProcessStartInfo("d:\\Program Files\\Tesseract-OCR\\tesseract");
            info.WindowStyle = ProcessWindowStyle.Hidden;
            info.Arguments = @"c:\users\raja_c\downloads\hocrimages\hocrimages_out.jpg c:\users\raja_c\downloads\hocrimages\hocrimages_out pdf";
            Process p = new Process();
            p.StartInfo = info;
            p.Start();
            p.WaitForExit();
            Document pd = new Document(@"c:\users\raja_c\downloads\hocrimages\hocrimages_out.pdf");
            pdfOutDcoument.Pages.Add(pd.Pages);

        }
        pdfOutDcoument.Save(FileOut);
        Console.Write("Time Taken:");
        Console.WriteLine(DateTime.Now-dt);
        Console.ReadLine();

    }

@crshekharam

We are afraid that Aspose.PDF does not provide any functionality to create a layer of text on the PDF to make it searchable except the method that has already been shared in this forum thread. However, we have logged your concerns along with the earlier logged issue and will definitely let you know as soon as we have some certain updates regarding issue rectification. Please spare us some time.

We are sorry for the inconvenience.

@Farhan.Raza , is there any update on this issue?

@crshekharam

We are afraid to share that earlier logged ticket is not yet resolved. Please note that it is not yet fully investigated and we are unable to share any reliable ETA at the moment as well. As soon as the ticket is fully analyzed, we will share updates with you. We highly appreciate your patience and comprehension in this regard. Please give us some time.

We are sorry for the inconvenience.