Converting to Word / Excel - some pre-sale questions

wingers999 · July 10, 2023, 12:48pm

Hi

Currently using Spire.PDF which is coming up for renewal but doesn’t meet all my requirements so thinking of changing.

I do have some questions however which sales said I could ask here:-

Can you convert PDF files with scanned content to Word and / or Excel - sample attached (OCR_Test.pdf)

The next questions depend really on answer to question 1 which I am assuming maybe no (unlike some products) and that I have to OCR the scanned PDF first to get it to text

I have attached same document as 1 but this time OCR’d (OCR_Test_converted_to_text.pdf) -

Using Spire.PDF if I convert it to XLSX then it seems to find an image layer and a text layer and converts both and in the resulting XLSX file (again attached - OCR_Test.XLSX) I have to remove the image to reveal the text underneath
Again using Spire.PDF if I convert it to DOCX then it only finds an image layer so converted file is useless as just contains image and no text (again attached the sample for you - OCR_Test.docx)

Would Aspose.PDF do any better for 3 and would it allow me to only convert the text for 2?

Obviously a huge investment for me financially (i am a sole trader) and no point in me changing if product can’t do a better job than what I currently use.

files.7z (8.3 MB)

asad.ali · July 10, 2023, 7:57pm

@wingers999

Please check the below code snippets that we used to convert OCR’d PDF into DOCX and Excel (only text content).

Document pdfDocument = new Document(dataDir + @"OCR_Test_converted_to_text.pdf");

foreach (var page in pdfDocument.Pages)
{
 TextFragmentAbsorber absorber = new TextFragmentAbsorber();
 absorber.Visit(page);
 foreach (TextFragment fragment in absorber.TextFragments)
 {
  fragment.TextState.RenderingMode = TextRenderingMode.FillText;
 }
 page.Resources.Images.Clear();
}

ExcelSaveOptions excelSaveOptions = new ExcelSaveOptions();
excelSaveOptions.Format = ExcelSaveOptions.ExcelFormat.XLSX;
pdfDocument.Save(dataDir + "output.xlsx", excelSaveOptions);
// Uncomment below lines for DOCX conversion
//DocSaveOptions saveOptions = new DocSaveOptions();
//saveOptions.Format = DocSaveOptions.DocFormat.DocX;
//saveOptions.Mode = DocSaveOptions.RecognitionMode.Flow;
//saveOptions.RelativeHorizontalProximity = 2.5f;
//saveOptions.RecognizeBullets = true;
//pdfDocument.Save(dataDir + @"output_flow.docx", saveOptions);

Aspose Working.zip (79.1 KB)

Please also check the attached output files and let us know in case they are not as per your expectations.

wingers999 · July 11, 2023, 10:08am

Thank you.

So assuming answer to question 1 was no then and I would have to OCR first.

Does Aspose.PDF give ability to OCR the PDF as well - or would I need another Aspose product to do that?

The quality of the output to Word is pretty poor (inaccurate) to be honest, I suspect due to the quality of the OCR perhaps, but certainly not good enough to provide as a output in a product - I know it is only a test PDF I chose, but still not what I expected.

asad.ali · July 11, 2023, 6:20pm

@wingers999

There is no direct way to adjust some property of PDF, in order to make it searchable through the API. However, you can convert a non-searchable PDF into searchable PDF document by using following code snippet.

private static void CreateSearchablePDF(string dataDir)
{
 Document doc = new Document(@"C:\Users\Home\Downloads\test.pdf");
 doc.Convert(CallBackGetHocr);
 doc.Save("E:/Data/test_searchable.pdf");
}

static string CallBackGetHocr(System.Drawing.Image img)
{
 string dir = @"E:\Data\";
 img.Save(dir + "ocrtest.jpg");
 ///V3.02
 System.Diagnostics.ProcessStartInfo info = new System.Diagnostics.ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
 info.WindowStyle = System.Diagnostics.ProcessWindowStyle.Hidden;
 info.Arguments = @"E:\data\ocrtest.jpg E:\data\out hocr";
 System.Diagnostics.Process p = new System.Diagnostics.Process();
 p.StartInfo = info;
 p.Start();
 p.WaitForExit();
 StreamReader streamReader = new StreamReader(@"E:\data\out.html");
 string text = streamReader.ReadToEnd();
 streamReader.Close();
 return text;
}

Above logic recognizes text for PDF images. For recognition you may use outer OCR supports HOCR standard (http://en.wikipedia.org/wiki/HOCR ). We have used free google tesseract OCR in the above code snippet. Please install it to you computer from http://code.google.com/p/tesseract-ocr/downloads/list , after that you will have tesseract.exe console application.