PDF to TIF Conversion Issue

Hi,
I have a requirement to convert an Image PDF to a Text PDF (searchable text)
I am thinking of Aspose PDF Kit to convert the PDF to a TIF image and later user an OCR software to convert it to searchable PDF.

I am using the following code to save a PDF document as TIF image.
---------------------Code------------------
Aspose.Pdf.Kit.PdfConverter converter = new Aspose.Pdf.Kit.PdfConverter();
converter.BindPdf(@“1448.pdf”);
converter.Resolution = 300;
converter.DoConvert();
//This is the typical way
converter.SavaAsTIFF(@“1448.tif”);
---------------------------------------------------------
It takes a long time to execute the method: SaveASTIFF() [around 6 mins]
And the the o/p was an empty TIF file (0 Kb)

I am attaching the PDF that gave this problem along with.

Please let me know your thoughts.

Thanks.!



Hi Sudheer,

I have reproduced and logged this issue as PDFKITNET-9469 in our issue tracking system. Our team will be looking into the matter and you'll be updated via this forum as the issue is resolved.

I would also like to add two more points for future reference:

1. The default value of the Resolution property is 150. The higher resolution, the slower converting speed will be.

2. In your code you have used SavaAsTiff method which is obsolete now. From now onwards, please use SaveAsTiff method; though, the problem is occuring with both methods, but we'll support SaveAsTiff method in future.

We're sorry for the inconvenience.

Regards,

Thanks.
I found that the attached PDF to TIF conversion using SaveAsTiff() works with resolution = 150.
I am OK with 150.

Is there a way to check pro grammatically whether a PDF is a Text or Image PDF?
I found this example:
Aspose.Total Product Family

But extractor.GetText() returns a few characters even for an Image PDF.
Is there a more reliable method for identifying Text PDFs?





Hi Sudheer,

The article you mentioned was written to provide a workaround to find whether the PDF file contains text or image only. I'm afraid that there is no direct method to perform this functionality. However, this way, if the file doesn't contain any text explicitly it shouldn't return any text in the output. Can you please share the sample PDF with us, so we could test it at our end and help you out?

Regards,

I am attaching a PDF that just contains an image.
So in this case, I shouldn’t get any text at all after the extraction.

extractor.ExtractText();
//Save the extracted text to a text file
extractor.GetText(ms);
// Check if the MemoryStream length is greater than or equal to 1
if (ms.Length >= 1)

Here I am getting ms.Length = 1.
So I can’t really distinguish b/w image & text PDF and do convert the image PDF to text PDF only if required.

Please let me know your thoughts.

Thanks.!

Hi Sudheer,

I have checked the code and the file as well. I have found that the file doesn't contain any text though, it contains two lines and a carriage return i.e. \n\r\n. So, there are two empty lines in the PDF which are being treated as text when extracted. You can check it using the following code:

ms.Flush();
ms.Position = 0;
System.IO.StreamReader sr = new System.IO.StreamReader(ms);
string s = sr.ReadToEnd();

I would suggest you to check the output via your code to see that whether the PDF contains some valuable text or not. I'm afraid, Aspose.Pdf.Kit doesn't provide any method to check this.

We're sorry for the inconvenience. If you have any further questions, please do let us know.

Regards,

The issues you have found earlier (filed as 9469) have been fixed in this update.


This message was posted using Notification2Forum from Downloads module by aspose.notifier.