How to find text after "Invoice #:"

jon_elster_i3intel_com · June 1, 2022, 3:17pm

Hi…
I can find the text “Invoice #” but how do get the text after the “Invoice #” ? The next 10 Chars?
thx

asad.ali · June 2, 2022, 12:32am

You can use regular expression like below in TextFragmentAbsorber Class:

[invoice #]+[0-9]{0,10}

Please feel free to let us know in case you face any issues.

jon_elster_i3intel_com · June 2, 2022, 1:49pm

Like this ??? I get 700+ fragments ?

                TextFragmentAbsorber textAbsorber = new TextFragmentAbsorber(new System.Text.RegularExpressions.Regex(@"[Invoice:]+[0-9]{0,10}"));

jon_elster_i3intel_com · June 2, 2022, 2:38pm

PLS HELP…

asad.ali · June 2, 2022, 3:58pm

@jon_elster_i3intel_com

Please share the sample PDF for our reference so that we can further proceed to assist you.

jon_elster_i3intel_com · June 2, 2022, 4:23pm

how do I share privately?

jon_elster_i3intel_com · June 2, 2022, 3:33pm

Text absorber not picking up Text. I have absorber.text = "Invoice: "
But my PDF has “Invoice: 12345”

any ideas? thx

asad.ali · June 2, 2022, 4:51pm

@jon_elster_i3intel_com

A private message has been sent to you for you to share the file privately. You can reply to it while attaching your file.

asad.ali · June 2, 2022, 8:59pm

@jon_elster_i3intel_com

We tried to use the regex i.e. [CERTIFICATE NUMBER:]+\s[0-9]{0,10} to find the text in your PDF but API was unable to extract the text. We also checked the regular expression on https://www.regextester.com/ and found that it was working fine. Therefore, it seems like API is not accepting/processing such kind of regular expression. An issue as PDFNET-51887 has been logged in our issue tracking system for further investigation. We will look into its details and keep you posted with the status of its correction. Please be patient and spare us some time.

We are sorry for the inconvenience.

jon.elster · June 2, 2022, 10:26pm

If I get the text… from the TextAbsorber, I can’t even use a regex to find “CERTIFICATE”?

asad.ali · June 3, 2022, 10:48am

@jon_elster_i3intel_com

How you are trying to find the text after using TextAbsorber? Can you please share the code snippet?

jon_elster_i3intel_com · June 3, 2022, 12:25pm

Hi

I’m looking for 8 digit number like this. But some 8 digits are not the number.

Also the certificate number appears not on the same line in textAbsorber

So none of this works

Regex expression = new Regex(@"(?<!\d)\d{8}(?!\d)", RegexOptions.Multiline);

var results = expression.Matches(textAbsorber.Text.Replace("\n", " “).Replace(”\r", " "));

thx

asad.ali · June 3, 2022, 6:58pm

@jon_elster_i3intel_com

While testing the case with 22.5 version of the API and the below code snippet, we managed to obtain 3 matches from this regular expression:

Document pdfDocument = new Document(dataDir + "110403(2 of 5) 419.pdf");
TextAbsorber tabsorber = new TextAbsorber();
pdfDocument.Pages.Accept(tabsorber);
string wholetext = tabsorber.Text;

Regex expression = new Regex(@"(?<!\d)\d{8}(?!\d)", RegexOptions.Multiline);

var results = expression.Matches(wholetext);

Can you please try to use the latest version of the API and let us know in case you still notice any issues.

jon.elster · June 3, 2022, 10:01pm

Thx… but we were targeting 2 – Certificate Number only

asad.ali · June 4, 2022, 10:59pm

@jon_elster_i3intel_com

Please extract the text using TextAbsorber and then copy all text in a .txt file and share that file with us. We will compare it with the results we are getting in our environment and will proceed further to assist you.