Hocr and searchable PDF

ThomasNk · August 31, 2016, 8:00am

Hi,

I try to convert PDF (aspose pdf 11.9.0) with Hocr generated from Tesseract 3.0.4.
- With Html hocr : do it nothing ! PDF is same before the transform.
- With xhtml hocr : convert method throw FormatException.

you can reproduce the issue using the attached project.

Here is a sample of my code:

public void Save(Func<int, Stream> getStream)

{
 
    using (var s = getStream(0))
    {
        this.asposeDoc.Convert(hocrTesseract);
        this.asposeDoc.Save(s);
    }
}
 
private string hocrTesseract(System.Drawing.Image img)
{
    using (var ocr = new TesseractEngine(@"(...)", "fra", EngineMode.Default))
    using (var bitmap = new Bitmap(img))
    using (var page = ocr.Process(bitmap))
    {
 
        return page.GetHOCRText(0);
  
    }
}

codewarior · September 1, 2016, 10:59am

Hi Thomas,

Thanks for using our API’s.

I have tested the scenario and have managed to reproduce same problem. For the sake of correction, I have logged it as PDFNET-41368 in our issue tracking system. We will further look into the details of this problem and will keep you posted on the status of correction. Please be patient and spare us little time. We are sorry for this inconvenience.

ThomasNk · December 23, 2016, 9:32am

Hi,

It 's same issue of DFNET-41118

I use regex to fix :

text = Regex.Replace(text, @"; x_wconf \d+", “”);

see also

tilal.ahmad · December 26, 2016, 12:50am

Hi Thomas,

Thanks for sharing your findings and good to know that you have found a workaround. We have passed on your findings to our product team and they will consider it during the issue investigation and resolution in Aspose.Pdf.

Best Regards,

asad.ali · August 18, 2023, 8:51pm

@ThomasNk

We have tested the case with the latest version i.e. 23.8 and it was not reproduced. Please use the latest version and let us know if you face any issues.