Highlighting TIFF converted OCR PDF does not work

skvasant · November 15, 2013, 4:51pm

Hi There,

I've been trying to highlight keywords in a tiff converted to OCR PDF and it does not work as expected. It does add a patch above the text, however entire point of highlighting is not helping since I can't see the text that it highlighted. I've attached the pdf files so that you guys understand what I am trying to say.

Any help on this will be appreciated.

Kind Regards,

Vasanth

codewarior · November 16, 2013, 6:49am

Hi Vasanth,

Thanks for contacting support.

As per my understanding, you have highlighted some text in TIFF image and then converted it to PDF format. When you are using Adobe Acrobat, the PDF file is properly being generated. Whereas when using Aspose.Pdf for .NET, the texts highlight color yellow becomes visible and text beneath it cannot be viewed. Can you please share the source TIFF image so that we can test the scenario at our end. We are sorry for your inconvenience.

skvasant · November 16, 2013, 7:16am

Not exactly. The tiff image was converted to OCR PDF and I used aspose.net pdf to highlight the text with yellow color programmatically. That's when it adds the highlight color and hides the text beneath.

Uploaded the tiff image as zip. Please see attachment.

codewarior · November 16, 2013, 7:45am

Hi Vasanth,

Thanks for sharing the resource files.

I have tested the scenario where I have first converted the TIFF image to PDF format using Aspose.Pdf.Generator object and then I have used Aspose.Pdf.Facades.PdfContentEditor class to add highlight annotation to PDF file and as per my observations, the contents of TIFF image are properly visible. For your reference, I have also attached the resultant PDF file (final PDF file) generated over my end using Aspose.Pdf for .NET 8.5.0.

[C#]

// instantiate PDF object<o:p></o:p>

Pdf pdf = new Pdf();

// create Section instance

Section CurrentPage = pdf.Sections.Add();

// create Image object

Aspose.Pdf.Generator.Image img = new Aspose.Pdf.Generator.Image();

CurrentPage.Paragraphs.Add(img);

// set source image file path

img.ImageInfo.File = @"C:\test.tiff";

// set the image file type

img.ImageInfo.ImageFileType = ImageFileType.Tiff;

// add all frames of tiff image to PDF file

img.ImageInfo.TiffFrame = -1;

MemoryStream stream = new MemoryStream();

// save the output image

pdf.Save(stream);

// create content editor object

Aspose.Pdf.Facades.PdfContentEditor editor = new PdfContentEditor();

// bind the source PDF file

editor.BindPdf(stream);

// create highlight annotation

editor.CreateMarkup(new System.Drawing.Rectangle(303, 645, 30, 10), "", 0, 1, System.Drawing.Color.Yellow);

// save updated file

editor.Save(@"C:\test_Highlight.pdf");

// close stream object

stream.Close();

skvasant · November 16, 2013, 7:52am

Appreciate your help. However at this point the pdf is not searchable for text anymore right? Can this be overcome? Also you have hardcoded the highlight using co-ordinates, in my case I am supposed to search for keywords in multiple pages and highlight the words and possibly the text should be searchable in the PDF. I know its a lot of expectation, however that would be our true solution to the problem we are facing at work.

Any help on this will be appreciated.

codewarior · November 16, 2013, 9:12am

Hi Vasanth,

Thanks for sharing the details.

As the source TIFF image is converted to PDF format, so the text is not searchable in PDF file. However in order to accomplish your requirement, you may first try performing OCR on source TIFF image and then try adding the text to PDF file. Once the text is added to PDF file, you can easily search the text and highlight the text fragments.

In my earlier post, I have hardcoded the coordinates of text inside PDF file because the text cannot be searched inside the PDF. However once the text is converted to PDF file, you can search the text inside PDF and perform required operations. For further details, please visit

In the event of any further query, please feel free to contact.

skvasant · November 16, 2013, 9:39am

I've looked at the links you mentioned before starting this support thread.

In my initial post, that's exactly what I did to highlight the text and that is when the textfragment highlight masked the original text. I attached the ocr pdf in the initial post and I am attaching it now so that you can take a look at it.

If you are able to highlight the text in this pdf without masking the text, please let me know how you did it. "test" is the word I highlighted using regex so that it would highlight "Test" and "test".

Again appreciate your help.

skvasant · November 16, 2013, 12:56pm

Is there a way we could do WebEx or call? Since we have had quite a few back and forth on this discussion. I wanted to make sure you understood the issue we are having for which we are trying to implement a solution.

codewarior · November 17, 2013, 12:36am

Hi Vasanth,

Thanks for sharing the details.

I have gone through your requirements and I have also managed to reproduce the same problem that when setting background color for text fragment, it covers the contents and textfragment is not visible. So in current circumstances, creating a Highlight markup annotation is viable solution. Please take a look over following code snippet where i have first searched the text fragment, extracted its coordinates (rectangular coordinates) and then have used the same values to create markup annotation.

I am afraid when using the below approach, the markup annotation is quite large (not according to coordinates specified). For the sake of correction, I have logged this
issue as PDFNEWNET-36059 in our issue tracking system. We will
further look into the details of this problem and will keep you updated on the
status of correction. Please be patient and spare us little time. We are sorry
for this inconvenience.

[C#]

//open document<o:p></o:p>

Document pdfDocument = new Document("c:/pdftest/test_from-graphic_ocr (1).pdf");

//create TextAbsorber object to find all instances of the input search phrase

TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("Test");

//accept the absorber for all the pages

pdfDocument.Pages[1].Accept(textFragmentAbsorber);

//get the extracted text fragments

TextFragmentCollection textFragmentCollection = textFragmentAbsorber.TextFragments;

// create Rectangle object to hold TextFragment rectangular coordinates

System.Drawing.Rectangle text_rect = new System.Drawing.Rectangle();

//loop through the fragments

foreach (TextFragment textFragment in textFragmentCollection)

{

foreach (TextSegment textSegment in textFragment.Segments)

{

// print text segment

Console.WriteLine("Text : {0} ", textSegment.Text);

Console.WriteLine("XIndent : {0} ",textSegment.Position.XIndent);

Console.WriteLine("YIndent : {0} ",textSegment.Position.YIndent);

// create Rectangle object based on TextFragment rectangle coordinates

text_rect = new System.Drawing.Rectangle((int)textFragment.Rectangle.LLX, (int)textFragment.Rectangle.LLY, (int)textFragment.Rectangle.URX, (int)textFragment.Rectangle.URY);

// print Rectangular coordinates of text fragment

Console.WriteLine("LLX = "+(int)textFragment.Rectangle.LLX + " LLY = " + (int)textFragment.Rectangle.LLY +" URX = " + (int)textFragment.Rectangle.URX +" URY = "+ (int)textFragment.Rectangle.URY);

}

// create content editor object

Aspose.Pdf.Facades.PdfContentEditor editor = new PdfContentEditor();

// bind the source PDF file

editor.BindPdf(pdfDocument);

// create markup

editor.CreateMarkup(text_rect, "", 0, 1, System.Drawing.Color.Yellow);

// save updated file

editor.Save(@“C:\pdftest\test_Highlight.pdf”);

As a workaround, please try using the following code line, instead. For your reference, I have also attached the resultant PDF generted with this workaround.

[C#]

text_rect = new System.Drawing.Rectangle((int)textFragment.Rectangle.LLX,
(int)textFragment.Rectangle.LLY, (int)textFragment.Rectangle.Width, (int)textFragment.Rectangle.Height);

codewarior · November 17, 2013, 12:43am

skvasant:

Is there a way we could do WebEx or call? Since we have had quite a few back and forth on this discussion. I wanted to make sure you understood the issue we are having for which we are trying to implement a solution.

Hi Vasanth,

Please note that we do not provide technical support over call/email. However we do offer sales support over phone/call and encourage technical support via forums. Nonetheless, we try our level best to get back to customer's queries within 24 Hours' time frame.

aspose.notifier · December 5, 2013, 2:09pm

The issues you have found earlier (filed as PDFNEWNET-36059) have been fixed in Aspose.Pdf for .NET 8.7.0.

This message was posted using Notification2Forum from Downloads module by Aspose Notifier.

codewarior · February 27, 2014, 12:42am

Hi Vasanth,

Thanks for your patience.

Just wanted to share that during our investigation of PDFNEWNET-36059, we have noticed a mistake in code snippet. Please note that the constructor accepts four parameters - System.Drawing.Rectangle(x1, y1, width, height).

Use following code line:
text_rect = new System.Drawing.Rectangle(
(int)textFragment.Rectangle.LLX,
(int)textFragment.Rectangle.LLY,
(int)textFragment.Rectangle.URX - (int)textFragment.Rectangle.LLX,
(int)textFragment.Rectangle.URY - (int)textFragment.Rectangle.LLY);