TIF to searchable PDF

super50505 · July 25, 2017, 11:18am

From the documentation, I see that it is possible to take TIF files and create a searchable PDF. It appears that to create a searchable PDF, we have to use the Aspose OCR engine. Is it possible to create a searchable text that that has already been obtained from a different OCR engine? For example, we want to take some TIF files, and already recognized text that has position information and supply it to Aspose to create a searchable PDF using the TIFs and the text. If that is possible, can you point me to the documentation to how to do that?

Thank You.

asad.ali · July 25, 2017, 9:25pm

@super50505

Thanks for contacting support.

As per my understanding of the scenario, you want to add image (TIFF) and text inside PDF altogether, so that text can be searched later from the resultant file. Please check following code snippet where I have added TIFF image inside PDF and TextStamp at the same page.

Document pdf1 = new Document();
MemoryStream ms = new MemoryStream();
new FileStream(dataDir + @"SourceTIF.tif", FileMode.Open).CopyTo(ms);
Bitmap myimage = new Bitmap(ms);
FrameDimension dimension = new FrameDimension(myimage.FrameDimensionsList[0]);
int frameCount = myimage.GetFrameCount(dimension);
for (int frameIdx = 0; frameIdx <= frameCount - 1; frameIdx++)
{
 Page sec = pdf1.Pages.Add();

 myimage.SelectActiveFrame(dimension, frameIdx);

 MemoryStream currentImage = new MemoryStream();
 myimage.Save(currentImage, ImageFormat.Tiff);

 if (myimage.Width > myimage.Height)
 {
   sec.PageInfo.IsLandscape = true;
 }
 else
 {
   sec.PageInfo.IsLandscape = false;
 }

 Aspose.Pdf.Image imageht = new Aspose.Pdf.Image();
 imageht.ImageStream = currentImage;
 imageht.IsBlackWhite = true;
 sec.Paragraphs.Add(imageht);
 TextStamp stamp = new TextStamp("Windows Fax and Scan");
  stamp.Background = false;
  // Specify font name for Stamp object
  stamp.TextState.Font = FontRepository.FindFont("Arial");

  // Specify Font size for TextStamp
  stamp.TextState.FontSize = 12;
  //stamp.TextState.Font.
  // Specify character specing as 1f
  stamp.TextState.CharacterSpacing = 1f;
  // Set the XIndent for Stamp
  stamp.XIndent = 100;
  // Set the YIndent for Stamp
  stamp.YIndent = 500;
  // Add textual stamp to page instance
  sec.AddStamp(stamp);
}

pdf1.Save(dataDir + "SearchablePDF.pdf");

Later I have tried to extract the text from resultant PDF while using following code snippet and API returned the text as output.

Document doc = new Document(dataDir + "SearchablePDF.pdf");
TextFragmentAbsorber tfa = new TextFragmentAbsorber("Windows Fax and Scan");
doc.Pages.Accept(tfa);
foreach(TextFragment tf in tfa.TextFragments)
{
 Console.WriteLine(tf.Text);
}

For your reference, I have attached input/output file(s) as well. In case if you have different requirements than that of my assumptions, please share some more details along with sample image and text, so that we can test the scenario in our environment and address it accordingly.

SearchablePDF.pdf (35.5 KB)
SourceTIF.zip (65.5 KB)

super50505 · July 27, 2017, 11:21am

Thank you for the response. This isn’t what we are looking to do. We do not want to add a visible stamp to the PDF. We have two files. One file is the TIF and the other file is a file that contains text recognition results from OCR of the TIF. The text file is in our own format which contains the text and positions for each character of the text. When converting the TIF to a PDF, can we also provide the text information with coordinates to make the PDF searchable. This additional text would not be visible on the page but it would be searchable and selectable within the PDF. I have attached a PDF that shows what we are looking to do. There are mistakes in the text because the image quality was so poor, but that doesn’t really matter for this example.

Thank you.

TifWithSearchableText.pdf (13.2 KB)

codewarior · July 27, 2017, 8:54pm

@super50505,

Thanks for sharing the details and sorry for the confusion caused.

If you already have text extracted, you may consider adding the text content to PDF file and in case you need to place each object on specific location, then you need to read individual word from text file, get the location information and place the word using TextSegment inside PDF document. For more information, please visit Adding text using position information.

Besides this, you may consider first converting TIFF images to PDF format using Aspose.Pdf for .NET, then you can use Aspose.Pdf in collaboration with some other OCR application supporting HOCR standards. A free google tesseract OCR can be used. So as described below, one can convert non-searchable PDF to searchable PDF document as described below. Once can install google tesseract OCR on his computer from tesseract-ocr · GitHub and after that you will have tesseract.exe console application.

In above stated approach, you do not need to remember/save the position information for each characters. Below you can see usage example:

[C#]

public void Main

{

Document doc = new Document("Input.pdf");

doc.Convert(CallBackGetHocr);

doc.Save("output.pdf");

}



private string CallBackGetHocr(System.Drawing.Image img)

{

string dir = @"c:\PdfTest\";

img.Save(dir + "test.jpg");

ProcessStartInfo info = new ProcessStartInfo(@"tesseract");

info.WindowStyle= ProcessWindowStyle.Hidden;

info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";

Process p = new Process();

p.StartInfo = info;

p.Start();

p.WaitForExit();

StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");

string text = streamReader.ReadToEnd();

streamReader.Close();

return text;

}