Free Support Forum - aspose.com

Embed HOCR into PDF

Hi,

We are trying to embed hocr content into a pdf file.
For most files of our customer, the output pdf is correctly generated.
But for one specific layout, the text is put at the bottom right.

We are using Aspose.Pdf 20.12.
The code we use is
using (var pdf = new Aspose.Pdf.Document(“C:\Aspose\test.pdf”))
{
pdf.Convert((image) =>
{
return Tesseract.ToHocr(image);
});
pdf.Save("“C:\Aspose\test_with_hocr.pdf”");
}

I’m attaching the input (test.pdf), the output (test_with_hocr.pdf), and the HOCR (hocr.html) generated by Tesseract.
You can avoid using Tesseract by returning the contents of the file “hocr.html”.
Aspose HOCR examples.zip (181.7 KB)

Thanks for looking into this.

@bensmartdoc

We were able to notice the issue in our environment. Could you please also share the information or download link of the Tesseract which you are using at your side and through which the sample HTML file has been generated. We will log an issue with related information necessary for investigation and share the ticket ID with you.

We are using Tesseract v5.0.0-alpha.2020112.
Download

@bensmartdoc

Thanks for sharing the requested information.

We have tested the scenario in our environment while using Aspose.PDF for .NET 20.12 and the following code snippet. Would you please check the attached output PDF document and let us know if you notice any issue inside it.

test_searchable.pdf (130.1 KB)

private static void CreateSearchablePDF(string dataDir)
{
 Document doc = new Document(dataDir + @"test.pdf");
 doc.Convert(CallBackGetHocr);
 doc.Save(dataDir + "test_searchable.pdf");
}

static string CallBackGetHocr(System.Drawing.Image img)
{
 return File.ReadAllText(@"G:\D Drive\Recent Working\Aspose\Aspose.Pdf-for-.NET-master\Examples\Data\AsposePDF\Annotations\hocr.html");
}

@bensmartdoc

Thanks for sharing the requested information.

We have tested the scenario in our environment while using Aspose.PDF for .NET 20.12 and the following code snippet. Would you please check the attached output PDF document and let us know if you notice any issue inside it.

test_searchable.pdf (130.1 KB)

private static void CreateSearchablePDF(string dataDir)
{
 Document doc = new Document(dataDir + @"test.pdf");
 doc.Convert(CallBackGetHocr);
 doc.Save(dataDir + "test_searchable.pdf");
}

static string CallBackGetHocr(System.Drawing.Image img)
{
 return File.ReadAllText(@"G:\D Drive\Recent Working\Aspose\Aspose.Pdf-for-.NET-master\Examples\Data\AsposePDF\Annotations\hocr.html");
}

Your output pdf seems to be correct.
What is the difference between your code and mine?

@bensmartdoc

The code snippet is not much different. Also, we tried the code snippet as below and did not notice any issue during testing:

using (var pdf = new Aspose.Pdf.Document(dataDir + @"test.pdf"))
{
 pdf.Convert((image) =>
 {
  return File.ReadAllText(dataDir + @"hocr.html");
 });
 pdf.Save(dataDir + "test_searchable.pdf");
}

Please also note that we first tested the scenario using Tesseract and found that hocr.html was not being created at our end due to some environment-related issue.

However, the output was generated fine when we skipped the part of using Tesseract and used the already generated file shared by you. Could you please try testing the case in a simple console application separately and share that application with us if the issue still persists. We will again perform testing at our end and share our feedback with you accordingly.

Hi,

We created a console application and could not recreate the same behavior as our app.
In the meantime, we upgraded to aspose pdf 21.1.

console application:

using (var pdf = new Aspose.Pdf.Document(@"D:\HOCr\test\Laga 1.pdf"))
{
    pdf.Convert((image) =>
    {
        return File.ReadAllText(@"D:\HOCr\test\Laga 1-p1.hocr");
    });
    pdf.Save(@"D:\HOCr\test\Laga 1-converted-console.pdf");
}

in our application:

using (var pdf = new Aspose.Pdf.Document(@"D:\HOCr\test\Laga 1.pdf"))
{
    pdf.Convert((image) =>
    {
        return File.ReadAllText(@"D:\HOCr\test\Laga 1-p1.hocr");
    });
    pdf.Save(@"D:\HOCr\test\Laga 1-converted-ourprogram.pdf");
}

These are the input and result files.
Hocr second test.zip (412.8 KB)

As you can see, the same code generates a different pdf file.
Is there a specific state Aspose uses that could explain this result?