Embed HOCR into PDF

bensmartdoc · January 11, 2021, 8:22am

Hi,

We are trying to embed hocr content into a pdf file.
For most files of our customer, the output pdf is correctly generated.
But for one specific layout, the text is put at the bottom right.

We are using Aspose.Pdf 20.12.
The code we use is
using (var pdf = new Aspose.Pdf.Document(“C:\Aspose\test.pdf”))
{
pdf.Convert((image) =>
{
return Tesseract.ToHocr(image);
});
pdf.Save("“C:\Aspose\test_with_hocr.pdf”");
}

I’m attaching the input (test.pdf), the output (test_with_hocr.pdf), and the HOCR (hocr.html) generated by Tesseract.
You can avoid using Tesseract by returning the contents of the file “hocr.html”.
Aspose HOCR examples.zip (181.7 KB)

Thanks for looking into this.

asad.ali · January 11, 2021, 8:40pm

@bensmartdoc

We were able to notice the issue in our environment. Could you please also share the information or download link of the Tesseract which you are using at your side and through which the sample HTML file has been generated. We will log an issue with related information necessary for investigation and share the ticket ID with you.

bensmartdoc · January 12, 2021, 7:56am

We are using Tesseract v5.0.0-alpha.2020112.
Download

asad.ali · January 12, 2021, 5:48pm

@bensmartdoc

Thanks for sharing the requested information.

We have tested the scenario in our environment while using Aspose.PDF for .NET 20.12 and the following code snippet. Would you please check the attached output PDF document and let us know if you notice any issue inside it.

test_searchable.pdf (130.1 KB)

private static void CreateSearchablePDF(string dataDir)
{
 Document doc = new Document(dataDir + @"test.pdf");
 doc.Convert(CallBackGetHocr);
 doc.Save(dataDir + "test_searchable.pdf");
}

static string CallBackGetHocr(System.Drawing.Image img)
{
 return File.ReadAllText(@"G:\D Drive\Recent Working\Aspose\Aspose.Pdf-for-.NET-master\Examples\Data\AsposePDF\Annotations\hocr.html");
}

asad.ali · January 12, 2021, 6:06pm

@bensmartdoc

Thanks for sharing the requested information.

We have tested the scenario in our environment while using Aspose.PDF for .NET 20.12 and the following code snippet. Would you please check the attached output PDF document and let us know if you notice any issue inside it.

test_searchable.pdf (130.1 KB)

private static void CreateSearchablePDF(string dataDir)
{
 Document doc = new Document(dataDir + @"test.pdf");
 doc.Convert(CallBackGetHocr);
 doc.Save(dataDir + "test_searchable.pdf");
}

static string CallBackGetHocr(System.Drawing.Image img)
{
 return File.ReadAllText(@"G:\D Drive\Recent Working\Aspose\Aspose.Pdf-for-.NET-master\Examples\Data\AsposePDF\Annotations\hocr.html");
}

bensmartdoc · January 13, 2021, 1:46pm

Your output pdf seems to be correct.
What is the difference between your code and mine?

asad.ali · January 13, 2021, 2:04pm

@bensmartdoc

The code snippet is not much different. Also, we tried the code snippet as below and did not notice any issue during testing:

using (var pdf = new Aspose.Pdf.Document(dataDir + @"test.pdf"))
{
 pdf.Convert((image) =>
 {
  return File.ReadAllText(dataDir + @"hocr.html");
 });
 pdf.Save(dataDir + "test_searchable.pdf");
}

Please also note that we first tested the scenario using Tesseract and found that hocr.html was not being created at our end due to some environment-related issue.

However, the output was generated fine when we skipped the part of using Tesseract and used the already generated file shared by you. Could you please try testing the case in a simple console application separately and share that application with us if the issue still persists. We will again perform testing at our end and share our feedback with you accordingly.

bensmartdoc · January 22, 2021, 8:22am

Hi,

We created a console application and could not recreate the same behavior as our app.
In the meantime, we upgraded to aspose pdf 21.1.

console application:

using (var pdf = new Aspose.Pdf.Document(@"D:\HOCr\test\Laga 1.pdf"))
{
    pdf.Convert((image) =>
    {
        return File.ReadAllText(@"D:\HOCr\test\Laga 1-p1.hocr");
    });
    pdf.Save(@"D:\HOCr\test\Laga 1-converted-console.pdf");
}

in our application:

using (var pdf = new Aspose.Pdf.Document(@"D:\HOCr\test\Laga 1.pdf"))
{
    pdf.Convert((image) =>
    {
        return File.ReadAllText(@"D:\HOCr\test\Laga 1-p1.hocr");
    });
    pdf.Save(@"D:\HOCr\test\Laga 1-converted-ourprogram.pdf");
}

These are the input and result files.
Hocr second test.zip (412.8 KB)

As you can see, the same code generates a different pdf file.
Is there a specific state Aspose uses that could explain this result?

bensmartdoc · January 22, 2021, 2:33pm

We did some more research

This code with the specified input files will result in a wrong placement of the text.

using (var pdf = new Aspose.Pdf.Document(@"D:\HOCr\test\Laga 1 bis.pdf"))
{
    pdf.Convert((image) =>
    {
        return File.ReadAllText(@"D:\HOCr\test\Laga 1-p1.hocr");
    });
    pdf.Save(@"D:\HOCr\test\Laga 1 bis-console.pdf");
}

HOCR third test.zip (227.1 KB)

The source of this issue is the assumption that on the end of a page, the graphic state will be pristine. But there is no requirement to do that as a document creator.

This can easily be solved by making the text content the first object on the page, this way you are always sure about the graphical state.

asad.ali · January 22, 2021, 8:56pm

@bensmartdoc

Thanks for getting back to us.

We were able to reproduce the issue in our environment and have logged it as PDFNET-49302 in our issue tracking system for the sake of further investigation. We will look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.

asad.ali · August 18, 2023, 8:48pm

@bensmartdoc

While testing the scenario with the latest version, the issue was not reproduced. It looks like the HOCR file provided by you needs to be regenerated.