Create a searchable, (auto) tagged PDF via hOCR

Dear Aspose support team,

I create a searchable PDF in the following way with Aspose.PDF Version 21.7.0 in C#:

        const string dataDir = @"C:\Temp\_\Aspose";

        var document = new Aspose.Pdf.Document();
        var page = document.Pages.Add();

        var width = 2550.0 * 72.0 / 300.0;
        var height = 3300.0 * 72.0 / 300.0;

        page.SetPageSize(width, height);

        page.PageInfo.Margin.Bottom = 0;
        page.PageInfo.Margin.Top = 0;
        page.PageInfo.Margin.Left = 0;
        page.PageInfo.Margin.Right = 0;

        using (var stream = File.OpenRead(System.IO.Path.Combine(dataDir, "0.tif")))
        {
            page.AddImage(File.ReadAllText(@"C:\Temp\_\Aspose\0.hOCR.html"), stream, new Aspose.Pdf.Rectangle(0, 0, width, height));
        }

        document.Convert(@"C:\Temp\_\Aspose\log.xml", Aspose.Pdf.PdfFormat.PDF_A_1A, Aspose.Pdf.ConvertErrorAction.Delete);

        using (var output = File.OpenWrite(System.IO.Path.Combine(dataDir, "pdf.pdf")))
        {
            document.Save(output);
        }

This works perfectly. But now I would like to add tags. Preferably automatically, of course, but I have not found a way. The following Convert call did not lead to the desired result.

        document.Convert(@"C:\Temp\_\Aspose\log.xml", Aspose.Pdf.PdfFormat.PDF_UA_1, Aspose.Pdf.ConvertErrorAction.Delete);

I came across the following approach:

(Document Accessibility with Aspose.PDF for .NET)

But here the text coordinates of the OCR are lost, so to speak. It seems to me the tagged content can only be added in “flow”, but not with fixed coordinates. Also, I can’t find the option to make the text invisible, as it is in the source document.

Does Aspose.PDF for .NET provide a way to accomplish this?

@KlausH

Could you please share sample file(s) along with some screenshots of the issue that you are facing? We will test the scenario in our environment using your code snippet and address it accordingly.

Thank you for taking care of this case. The referenced files 0.tif and 0.hOCR.html are in the zip 0.zip (38.0 KB). For this example the size of the image has changed, now it is 2479 x 3504.

This is the PDF I would like to have: a searchable PDF with tags: expected.pdf (122.1 KB)

It has tags: exptected tags.png (46.5 KB)

Please also note, if you open the expected.pdf with Adobe Reader, do a Select All on the text, then the coordinates of the letters of the selected text will match.

Here now the result, which is generated with the above code: result.pdf (147.8 KB)

It has no tags, which is to be expected because there is no code for it either: no tags.png (11.9 KB)

Now I would like to know: What do I have to do to get a PDF, like the expected.pdf, a searchable pdf with tags? Which also has the correct text coordinates (“Text Select All test”).

@KlausH

We were able to reproduce the similar issue in our environment while testing the scenario with Aspose.PDF for .NET 21.8. Therefore, an issue as PDFNET-50485 has been logged in our issue tracking system. We will further look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.

Thank you very much. Since we need this feature in the upcoming release of our software, it would be great if you could add it to Aspose.PDF as soon as possible. Is it possible to make a rough statement regarding the date and the Aspose.PDF version?

@KlausH

We are afraid that we cannot share any reliable ETA without an investigation of the issue. Please note that the issue has been logged in free support model where issues are investigated and resolved on a first come first serve basis. The resolution time of the issue depends upon its complexity and the number of issues logged prior to it unlike priority support where issues have high priority and are resolved on an urgent basis.

We will surely inform you as soon as we have some definite updates regarding the resolution of the ticket. Please spare us some time.

We are sorry for the inconvenience.

Hi Asad,

Thank you very much. Ok, then I would like to “convert” this into a priority support case. How can I do that? I still need to discuss this with my manager as well. I think we still have tickets available.

@KlausH

You can create a post in priority support forum and share the ticket ID there in your post and your issue will be escalated accordingly.