HOCR without image processing

snewby · December 20, 2017, 4:04pm

The current (java) mechanism for overlaying HOCR text with documents is to use the Document.CallBackGetHocr interface with Document.convert method. This requires decoding of each image which can be pretty CPU/Memory intensive.

If we already have the HOCR for the document, is it possible just to overlay the text without using the callback/image decoding? I simply want to add the text at the required positions.

imran.rafique · December 21, 2017, 12:11am

@snewby,

You can retrieve the rectangular position of the image, delete this image, and then add text at this rectangle position. Please refer to these help topics: Retrieve the rectangular position of an image, Add Text with TextParagraph and Delete Images from a PDF File

snewby · December 21, 2017, 2:24pm

Those are for the .NET API but I think I’ve found the java equivalent. Seems like there are more factors to consider here (transparent text, text positioning adjustment relative to the image x/y, font scaling to fit the bounding box, etc.). I thought maybe there was a way to simply provide the HOCR without the image decoding but maybe not?

imran.rafique · December 21, 2017, 11:22pm

@snewby,

Kindly send us the source PDF and expected output PDF documents. We will investigate and share our findings with you.

snewby · December 26, 2017, 4:53pm

Sure, here’s the example documents (before and after HOCR)

source.pdf (24.8 KB)
expected.pdf (77.4 KB)

imran.rafique · December 27, 2017, 12:24am

@snewby,

We have logged a feature request under the ticket ID PDFJAVA-37343 in our issue tracking system to add HOCR formatted text in a PDF document. You might also share the sample HOCR formatted samples which you require to add in the PDF document. We have linked your post to this ticket and will keep you informed regarding any available updates.

snewby · January 2, 2018, 5:08pm

Thanks! I’m attaching the HOCR html file we used for this document.

hocr.zip (2.5 KB)

imran.rafique · January 3, 2018, 3:25am

@snewby,

Thank you. We have logged the sample HOCR document under the same ticket ID PDFJAVA-37343 in our issue tracking.

imran.rafique · February 14, 2018, 5:45am

@snewby,

In reference to the linked ticket ID PDFJAVA-37343, please use the following code:
Java

final String myDir = "C:/path/";
Document doc = new Document(myDir + "source.pdf");
doc.convert(new Document.CallBackGetHocr()
{
    @Override
    public String invoke(java.awt.image.BufferedImage bi)
    {
        //Ignore the argument "bi" and use the existing hocr.html file

       try
       {
           int len;
           char[] chr = new char[4096];
           final StringBuffer buffer = new StringBuffer();
           final FileReader reader = new FileReader(myDir + "hocr.html");
           try
           {
               while ((len = reader.read(chr)) > 0)
               {
                   buffer.append(chr, 0, len);
               }
               } finally
               {
                   reader.close();
                }
            return buffer.toString();
        } catch (FileNotFoundException e)
                {
                    e.printStackTrace();
                } catch (java.lang.Exception exc)
                {
                    exc.printStackTrace();
                }
                return null;
            }
        });
doc.save(myDir + "converted.hocr_out.pdf");

aspose.notifier · February 7, 2019, 6:00pm

The issues you have found earlier (filed as ) have been fixed in this update. This message was posted using BugNotificationTool from Downloads module by MuzammilKhan