Trying to extract Sender Fax # and Sender Fax name from Fax header sheet or title

I’m new to Aspose and I’m looking for working code examples that can pull the Sender Fax number and Sender Name from a fax saved as a PDF.


I’ve been trying the examples on the Aspose web site, and have had no luck extracting data from the fax pdf document.

Thoughts?

Thanks,

Hi Scott,


Thank you for contacting support. Please give us a sample fax PDF document. We’ll check and provide you a source code to extract the required fields. It’ll help us to be more specific.

I’m looking for 2 scenarios to use…


1) Pull the first line of data on page 1. Looking for the Sender Fax # and Sender fax name.

2) Pull all data and search thru it for the Sender fax # and Sender fax name.

See attached Fax example…

Thanks!

Hi Scott,

Thanks for sharing the resource file.

In order to accomplish your requirements, you need to first convert non-searchable (scanned) PDF file to searchable document and then extract page contents (get Fax # and Sender Fax name) information from searched contents. Please try using following code snippet to convert Scanned PDF file to Searchable PDF document and following the instructions specified over following link for Extract Text from Pages using Text Device

C# code

public void Main()
{
    Document doc = new Document("Input.pdf");
    doc.Convert(CallBackGetHocr);
    doc.Save("output.pdf");
}

private string CallBackGetHocr(System.Drawing.Image img)
{
    string dir = @"c:\PdfTest";
    img.Save(dir + "test.jpg");
    ProcessStartInfo info = new ProcessStartInfo(@"tesseract");
    info.WindowStyle = ProcessWindowStyle.Hidden;
    info.Arguments = @"c:\pdftest\test.jpg c:\pdftest\out hocr";
    Process p = new Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();
    StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");
    string text = streamReader.ReadToEnd();
    streamReader.Close();
    return text;
}
1 Like

In the C# code snippet you are using:

Document doc = new Document("Input.pdf");
doc.Convert(CallBackGetHocr);

What class are you using to get the the doc.Convert() function?

I’m not finding it in any Windows reference assemblies …

I also didn’t see it in the Aspose.Words.dll either …

Thanks!

Hi Scott,


Thanks for contacting support.

The above stated Document class is present under Aspose.Pdf namespace and fully qualified name of class is Aspose.Pdf.Document.

Howdy,


Got a few questions for you… I’m not able to get this example to work…

1) Where are you getting or calling tesseract from ? An Aspose library?
2) I’ll try and attach a fresh pdf file for you look at. I want to capture the From name and the From Fax Number.
3) Is there another way to convert the Jpeg to searchable text?
4) If you can give me some C# code that works right away that would be awesome…

Thanks!

ScottK:
1) Where are you getting or calling tesseract from ? An Aspose library?
Hi Scott,

Thanks for sharing the details.

TesserAct is Google API for OCR. You can download its executable from this link.
ScottK:
2) I’ll try and attach a fresh pdf file for you look at. I want to capture the From name and the From Fax Number.
Please try using code snippet as during our testing, we were able to perform OCR on earlier shared PDF document.
ScottK:
3) Is there another way to convert the Jpeg to searchable text?
In order to accomplish this requirement, first you need to Convert an Image to PDF, convert Non-Searchable (image PDF) file to Searchable PDF and then extract page contents.
ScottK:
4) If you can give me some C# code that works right away that would be awesome…
The earlier shared code snippet is working fine in our environment.

Howdy,


Got a few more questions for you…

1) Does Aspose have the Google OCR built in one of their dll’s by chance? The Google tesseract OCR is not on our companies Approved Software List.

2) Does Aspose have a OCR function like Google’s tesseract? I would assume Aspose would have their own OCR functionality since its a PDF processing software solution package. I did see we have Aspose.OCR.dll in our installation of Apose. Will that not work?

Thoughts?

Thanks!

ScottK:

  1. Does Aspose have the Google OCR built in one of their dll’s by chance? The Google tesseract OCR is not on our companies Approved Software List.
    Hi Scott,

I am afraid we do not have any API which internally uses Google OCR. So if requirement, you need to use it separately.

ScottK:

  1. Does Aspose have a OCR function like Google’s tesseract? I would assume Aspose would have their own OCR functionality since its a PDF processing software solution package. I did see we have Aspose.OCR.dll in our installation of Apose. Will that not work?
    Aspose.OCR for .NET provides the feature to perform OCR on Image files but the contents are extracted in raw format (without preserving the formatting information). So as per your requirements, you can Convert PDF Pages to JPEG Images and then try Performing OCR on an Image . In case you encounter any issue, please do let us know.

Hi Nayyer,


I excatly need to achieve the same. Trying your code but getting error: System.ComponentModel.Win32Exception: The system cannot find the file specified.

How did you refference Tesseract in this sample project? What is out hocr in this line of code?
info.Arguments = @“c:\pdftest\test.jpg c:\pdftest\out hocr”;

Regards,
Dave

Hi Dave,

The out is name of HTML file which is generated by HOCR and during my testing with latest release of Aspose.Pdf for .NET 11.8.0, I am unable to notice any issue and Searchable PDF file with text is properly being generated.

For your reference, I have also attached the output generated over my end. Please take a look.

[C#]

Document doc = new Document("c:/ pdftest / VWDDBZFAXA002_1607151628193096.PDF");
doc.Convert(CallBackGetHocr);
doc.Save("c:/pdftest/VWDDBZFAXA002_1607151628193096_output.pdf");

private string CallBackGetHocr(System.Drawing.Image img)
{
    string dir = @"c:\Pdftest\";
    img.Save(dir + "VWDDBZFAXA002_1607151628193096_test.jpg");

    ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");
    info.WindowStyle = ProcessWindowStyle.Hidden;
    info.Arguments = @"c:\pdftest\VWDDBZFAXA002_1607151628193096_test.jpg c:\pdftest\out hocr";

    Process p = new Process();
    p.StartInfo = info;
    p.Start();
    p.WaitForExit();

    StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");
    string text = streamReader.ReadToEnd();
    streamReader.Close();
    return text;
}
@Nayyer Shahbaz
I am trying to do the exact thing and but your responses are not clear. I was getting the same exeception as @Stibbedevelopers. The error was "System.ComponentModel.Win32Exception: The system cannot find the file specified."

In response to @Stibbedevelopers you are using

ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\Tesseract-OCR\tesseract.exe");

So I installed it using the link provided and when I used your second code sample

now I get error Could not find file 'c:\pdftest\out.html'. at line
StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");

Where is out.html file?


also the question still stands

1>Do we have to use tesseract along with Aspose.pdf in order to make Image searchable? Can aspose alone convert Image into searchable pdf?

2>Out of these 2 codes you provided which code is correct?
3>The link you provided you download tesseract, actually installs PDF Splitter Pro at location "C:\Program Files (x86)\CoolUtils\PDF Splitter Pro"

Hi Laksh,


My problem is solved by installing tesseract-ocr-setup-3.02.02.exe from link below.

tesseract-ocr alternative download - Browse Files at SourceForge.net

Did you already do this?

Regards,
Stibbedevelopers


P.S.
@Nayyer: Unfortunately the Convert methode is unable to make all kind of pdf files searchable using tesseract. It seems the problem has to do with how aspose pdf extract images from the pdf file.
I can send you a sample, but need some time. Aspose OCR does not seems to be doing the job either. Do you have any suggestions?


I installed exe from the provided link which installs PDF Splitter Pro at location "C:\Program Files (x86)\CoolUtils\PDF Splitter Pro. The PDF Splitter Pro includes "tesseract.exe"
So I pass that location to ProcessStartInfo constructor.

ProcessStartInfo info = new ProcessStartInfo(@"C:\Program Files (x86)\CoolUtils\PDF Splitter Pro\tesseract.exe");

However it dose not create "out.html" but instead I see "out.txt" is created

So I change the line

StreamReader streamReader = new StreamReader(@"c:\pdftest\out.html");

to

StreamReader streamReader = new StreamReader(@"c:\pdftest\out.txt");

But now I get error
An unhandled exception of type 'System.Xml.XmlException' occurred in System.Xml.dll
Additional information: Data at the root level is invalid. Line 1, position 1.
If I look at my directory it looks like this "C:\Program Files (x86)\Tesseract-OCR\tesseract.exe".
It seems you installed a third party tool "CoolUtils" which uses tesseract for OCR. Tesseract needs tessdata folder which may not exist in that third party tool directory.
Install tesseract from the link below and your error wil goes away.

link to download:
https://sourceforge.net/projects/tesseract-ocr-alt/files/latest/download?source=files

Dave

Laksh:

@Nayyer Shahbaz
I am trying to do the exact thing and but your responses are not clear. I was getting the same exeception as @Stibbedevelopers. The error was “System.ComponentModel.Win32Exception: The system cannot find the file specified.”

In response to @Stibbedevelopers you are using

ProcessStartInfo info = new ProcessStartInfo(@“C:\Program Files (x86)\Tesseract-OCR\tesseract.exe”);

So I installed it using the link provided and when I used your second code sample

now I get error Could not find file ‘c:\pdftest\out.html’. at line

StreamReader streamReader = new StreamReader(@“c:\pdftestout.html”);

Where is out.html file?
Hi Laxmikant,

In order to resolve this problem, you may consider creating a sample / dummy out.html file on path provided as an argument to StreamReader object.

also the question still stands

Laksh:

1>Do we have to use tesseract along with Aspose.pdf in order to make Image searchable? Can aspose alone convert Image into searchable pdf?
Tesseract is required to perform OCR on image and the result of this API is used by Aspose.Pdf for .NET to create searchable PDF file.

Laksh:

2>Out of these 2 codes you provided which code is correct?

3>The link you provided you download tesseract, actually installs PDF Splitter Pro at location “C:\Program Files (x86)\CoolUtils\PDF Splitter Pro”

Please try getting Tesseract executable from this link and try using the code snippet recently shared in 758532.

Stibbedevelopers:
Hi Laksh,

My problem is solved by installing tesseract-ocr-setup-3.02.02.exe from link below.


Hi,

Thanks for the acknowledgement. We are glad to hear that you have managed to run the solution.

Stibbedevelopers:
P.S.
@Nayyer: Unfortunately the Convert methode is unable to make all kind of pdf files searchable using tesseract. It seems the problem has to do with how aspose pdf extract images from the pdf file.
I can send you a sample, but need some time. Aspose OCR does not seems to be doing the job either. Do you have any suggestions?
Can you please share the resource PDF documents which are causing this problem, so that we can test the scenario in our environment. We are sorry for this inconvenience.

ok I also resolved my issue by installing tesseract from

tesseract-ocr alternative download - Browse Files at SourceForge.net

Thanks @Stibbedevelopers

Laksh:
ok I also resolved my issue by installing tesseract from
https://sourceforge.net/projects/tesseract-ocr-alt/files/

Thanks @Stibbedevelopers
Hi Laxmikant,

Thanks for the acknowledgement.

We are glad to hear that your problem is resolved. Please continue using our API's and in the event of any further query, please feel free to contact.

PS,
Hi Stibbedevelopers,

Thanks for the cooperation.