I’m new to Aspose and I’m looking for working code examples that can pull the Sender Fax number and Sender Name from a fax saved as a PDF.
Hi Scott,
I’m looking for 2 scenarios to use…
Hi Scott,
Thanks for sharing the resource file.
In order to accomplish your requirements, you need to first convert non-searchable (scanned) PDF file to searchable document and then extract page contents (get Fax # and Sender Fax name) information from searched contents. Please try using following code snippet to convert Scanned PDF file to Searchable PDF document and following the instructions specified over following link for Extract Text from Pages using Text Device
C# code
public void Main()
{
Document doc = new Document("Input.pdf");
doc.Convert(CallBackGetHocr);
doc.Save("output.pdf");
}
private string CallBackGetHocr(System.Drawing.Image img)
{
string dir = @"c:\PdfTest";
img.Save(dir + "test.jpg");
ProcessStartInfo info = new ProcessStartInfo(@"tesseract");
info.WindowStyle = ProcessWindowStyle.Hidden;
info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";
Process p = new Process();
p.StartInfo = info;
p.Start();
p.WaitForExit();
StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}
In the C# code snippet you are using:
Document doc = new Document("Input.pdf");
doc.Convert(CallBackGetHocr);
What class are you using to get the the doc.Convert() function?
I’m not finding it in any Windows reference assemblies …
I also didn’t see it in the Aspose.Words.dll either …
Thanks!
Hi Scott,
Howdy,
Hi Scott,ScottK:1) Where are you getting or calling tesseract from ? An Aspose library?
Please try using code snippet as during our testing, we were able to perform OCR on earlier shared PDF document.ScottK:2) I’ll try and attach a fresh pdf file for you look at. I want to capture the From name and the From Fax Number.
In order to accomplish this requirement, first you need to Convert an Image to PDF, convert Non-Searchable (image PDF) file to Searchable PDF and then extract page contents.ScottK:3) Is there another way to convert the Jpeg to searchable text?
The earlier shared code snippet is working fine in our environment.ScottK:4) If you can give me some C# code that works right away that would be awesome…
Howdy,
ScottK:
- Does Aspose have the Google OCR built in one of their dll’s by chance? The Google tesseract OCR is not on our companies Approved Software List.
Hi Scott,
I am afraid we do not have any API which internally uses Google OCR. So if requirement, you need to use it separately.
ScottK:
- Does Aspose have a OCR function like Google’s tesseract? I would assume Aspose would have their own OCR functionality since its a PDF processing software solution package. I did see we have Aspose.OCR.dll in our installation of Apose. Will that not work?
Aspose.OCR for .NET provides the feature to perform OCR on Image files but the contents are extracted in raw format (without preserving the formatting information). So as per your requirements, you can Convert PDF Pages to JPEG Images and then try Performing OCR on an Image . In case you encounter any issue, please do let us know.
Hi Nayyer,
Hi Dave,
The out is name of HTML file which is generated by HOCR and during my testing with latest release of Aspose.Pdf for .NET 11.8.0, I am unable to notice any issue and Searchable PDF file with text is properly being generated.
For your reference, I have also attached the output generated over my end. Please take a look.
[C#]
Document doc = new Document("c:/ pdftest / VWDDBZFAXA002_1607151628193096.PDF");
doc.Convert(CallBackGetHocr);
doc.Save("c:/pdftest/VWDDBZFAXA002_1607151628193096_output.pdf");
private string CallBackGetHocr(System.Drawing.Image img)
{
string dir = @"c:\Pdftest\";
img.Save(dir + "VWDDBZFAXA002_1607151628193096_test.jpg");
ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
info.WindowStyle = ProcessWindowStyle.Hidden;
info.Arguments = @"c:\pdftest\VWDDBZFAXA002_1607151628193096_test.jpg c:\pdftest\out hocr";
Process p = new Process();
p.StartInfo = info;
p.Start();
p.WaitForExit();
StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");
string text = streamReader.ReadToEnd();
streamReader.Close();
return text;
}
I am trying to do the exact thing and but your responses are not clear. I was getting the same exeception as @Stibbedevelopers. The error was "System.ComponentModel.Win32Exception: The system cannot find the file specified."
In response to @Stibbedevelopers you are using
also the question still stands
Hi Laksh,
Laksh:
@Nayyer Shahbaz
I am trying to do the exact thing and but your responses are not clear. I was getting the same exeception as @Stibbedevelopers. The error was “System.ComponentModel.Win32Exception: The system cannot find the file specified.”In response to @Stibbedevelopers you are using
ProcessStartInfo info = new ProcessStartInfo(@“C:\Program Files (x86)\Tesseract-OCR\tesseract.exe”);
So I installed it using the link provided and when I used your second code sample
now I get error Could not find file ‘c:\pdftest\out.html’. at line
StreamReader streamReader = new StreamReader(@“c:\pdftestout.html”);
Where is out.html file?
Hi Laxmikant,
In order to resolve this problem, you may consider creating a sample / dummy out.html file on path provided as an argument to StreamReader object.
also the question still stands
Laksh:
1>Do we have to use tesseract along with Aspose.pdf in order to make Image searchable? Can aspose alone convert Image into searchable pdf?
Tesseract is required to perform OCR on image and the result of this API is used by Aspose.Pdf for .NET to create searchable PDF file.
Laksh:
2>Out of these 2 codes you provided which code is correct?
3>The link you provided you download tesseract, actually installs PDF Splitter Pro at location “C:\Program Files (x86)\CoolUtils\PDF Splitter Pro”
Please try getting Tesseract executable from this link and try using the code snippet recently shared in 758532.
Hi,Stibbedevelopers:Hi Laksh,My problem is solved by installing tesseract-ocr-setup-3.02.02.exe from link below.
Can you please share the resource PDF documents which are causing this problem, so that we can test the scenario in our environment. We are sorry for this inconvenience.Stibbedevelopers:P.S.@Nayyer: Unfortunately the Convert methode is unable to make all kind of pdf files searchable using tesseract. It seems the problem has to do with how aspose pdf extract images from the pdf file.I can send you a sample, but need some time. Aspose OCR does not seems to be doing the job either. Do you have any suggestions?
ok I also resolved my issue by installing tesseract from
Hi Laxmikant,Laksh:ok I also resolved my issue by installing tesseract fromhttps://sourceforge.net/projects/tesseract-ocr-alt/files/Thanks @Stibbedevelopers