Converting PDF to textsearchable PDF generates error

robert.strahner · February 13, 2020, 8:27am

Hello there,
I’m interested in your Barcode, PDF, and OCR PlugIns because of the following situation:
I’ve got scanned .tif(f) documents and want to convert them to readable PDF documents.
Inside your forums I found according articles, which unfortunately didn’t produce the desired result.

My Code:
public static string ConvertPDFToSearchable(string file)
{
Aspose.Pdf.Document doc = new Aspose.Pdf.Document(file);
doc.Convert(CallBackGetHocr);
doc.Save(file + “-Version4.pdf”);
return file + “-Version4.pdf”;
}
static string CallBackGetHocr(System.Drawing.Image img)
{

		string dir = @"C:\Temp\";
		img.Save(dir + "ocrtest.jpg");
		ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files\Tesseract-OCR\tesseract.exe");
		info.WorkingDirectory = @"C:\Program Files\Tesseract-OCR";
		info.WindowStyle = ProcessWindowStyle.Hidden;
		info.Arguments = @"C:\Temp\ocrtest.jpg C:\Temp\out hocr";
		Process p = new Process();
		p.StartInfo = info;
		p.Start();
		p.WaitForExit();
		StreamReader streamReader = new StreamReader(@"C:\Temp\out.txt");
		string text = streamReader.ReadToEnd();
		streamReader.Close();
		return text;
	}

Returning from the Callback function I get the error message (“System.Xml.XmlException: 'Ungültige Daten auf Stammebene. Zeile 1, Position 1.”). The error is in german, because I’m using an austrian/german environment, but the meaning of the message is something like the produced information seems to be not XML conform. (“Invalid data on root node line 1 position 1” or similar meaning)

Does this error occur because of licencing problems (actually I’m using a trial, because I’m in the evaluating period, where I have to choose the right product), or are there maybe problems with the german environment?
I’ve tried different documents, but the error remained.

Thank you for your investigations and answers in advance
Kind regards,
Robert

Adnan.Ahmad · February 13, 2020, 4:28pm

@robert.strahner,

Can you please share source files along with Aspose.PDF version details which you are using on your end. We will investigate this further on our end to help you out in this issue.

robert.strahner · February 17, 2020, 9:36am

I think I’ve found the problem.
When running the process (p.Start in the example above) tesseract doesn’t produce the out.hocr file, but a file called out.txt (which is text formatted, but not xml formatted)
When calling the same command from the commandline everything works OK.
Do you have any recommendations for special processInfo and -start behaviours, or calling converntions?
I’ll try by myself and will provide you with a solution, if I’m able to find it.
Robert

Adnan.Ahmad · February 17, 2020, 10:08pm

@robert.strahner,

Can you please share source file with us. We will investigate this further on our end and will respond you accordingly.