Converting to Word / Excel - some pre-sale questions

Hi

Currently using Spire.PDF which is coming up for renewal but doesn’t meet all my requirements so thinking of changing.

I do have some questions however which sales said I could ask here:-

  1. Can you convert PDF files with scanned content to Word and / or Excel - sample attached (OCR_Test.pdf)

The next questions depend really on answer to question 1 which I am assuming maybe no (unlike some products) and that I have to OCR the scanned PDF first to get it to text

I have attached same document as 1 but this time OCR’d (OCR_Test_converted_to_text.pdf) -

  1. Using Spire.PDF if I convert it to XLSX then it seems to find an image layer and a text layer and converts both and in the resulting XLSX file (again attached - OCR_Test.XLSX) I have to remove the image to reveal the text underneath

  2. Again using Spire.PDF if I convert it to DOCX then it only finds an image layer so converted file is useless as just contains image and no text (again attached the sample for you - OCR_Test.docx)

Would Aspose.PDF do any better for 3 and would it allow me to only convert the text for 2?

Obviously a huge investment for me financially (i am a sole trader) and no point in me changing if product can’t do a better job than what I currently use.

files.7z (8.3 MB)

@wingers999

Please check the below code snippets that we used to convert OCR’d PDF into DOCX and Excel (only text content).

Document pdfDocument = new Document(dataDir + @"OCR_Test_converted_to_text.pdf");

foreach (var page in pdfDocument.Pages)
{
 TextFragmentAbsorber absorber = new TextFragmentAbsorber();
 absorber.Visit(page);
 foreach (TextFragment fragment in absorber.TextFragments)
 {
  fragment.TextState.RenderingMode = TextRenderingMode.FillText;
 }
 page.Resources.Images.Clear();
}

ExcelSaveOptions excelSaveOptions = new ExcelSaveOptions();
excelSaveOptions.Format = ExcelSaveOptions.ExcelFormat.XLSX;
pdfDocument.Save(dataDir + "output.xlsx", excelSaveOptions);
// Uncomment below lines for DOCX conversion
//DocSaveOptions saveOptions = new DocSaveOptions();
//saveOptions.Format = DocSaveOptions.DocFormat.DocX;
//saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
//saveOptions.RelativeHorizontalProximity = 2.5f;
//saveOptions.RecognizeBullets = true;
//pdfDocument.Save(dataDir + @"output_flow.docx", saveOptions);

Aspose Working.zip (79.1 KB)

Please also check the attached output files and let us know in case they are not as per your expectations.

Thank you.

So assuming answer to question 1 was no then and I would have to OCR first.

Does Aspose.PDF give ability to OCR the PDF as well - or would I need another Aspose product to do that?

The quality of the output to Word is pretty poor (inaccurate) to be honest, I suspect due to the quality of the OCR perhaps, but certainly not good enough to provide as a output in a product - I know it is only a test PDF I chose, but still not what I expected.

@wingers999

There is no direct way to adjust some property of PDF, in order to make it searchable through the API. However, you can convert a non-searchable PDF into searchable PDF document by using following code snippet.

private static void CreateSearchablePDF(string dataDir)
{
 Document doc = new Document(@"C:\Users\Home\Downloads\test.pdf");
 doc.Convert(CallBackGetHocr);
 doc.Save("E:/Data/test_searchable.pdf");
}

static string CallBackGetHocr(System.Drawing.Image img)
{
 string dir = @"E:\Data\";
 img.Save(dir + "ocrtest.jpg");
 ///V3.02
 System.Diagnostics.ProcessStartInfo info = new System.Diagnostics.ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
 info.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;
 info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";
 System.Diagnostics.Process p = new System.Diagnostics.Process();
 p.StartInfo = info;
 p.Start();
 p.WaitForExit();
 StreamReader streamReader = new StreamReader(@"E:\data\out.html");
 string text = streamReader.ReadToEnd();
 streamReader.Close();
 return text;
}

Above logic recognizes text for PDF images. For recognition you may use outer OCR supports HOCR standard (http://en.wikipedia.org/wiki/HOCR ). We have used free google tesseract OCR in the above code snippet. Please install it to you computer from http://code.google.com/p/tesseract-ocr/downloads/list , after that you will have tesseract.exe console application.