API example of this - https://products.aspose.app/email/extractor

tim330i · October 31, 2023, 8:45pm

I want to take a random PDF, pass it to an API, and get an email address(es) in return. Can you direct me to the right API, documentation, or an example code, please?

Thanks,
Tim

alexey.noskov · November 1, 2023, 6:08am

@tim330i You can use Aspose.Words and it’s Find/Replace functionality to achieve this. For example see the following code, which uses regular expression to find e-mail addresses in the document:

Document doc = new Document(@"C:\Temp\in.pdf");

EmailsCollecotor collecotor = new EmailsCollecotor();
FindReplaceOptions opt = new FindReplaceOptions();
opt.ReplacingCallback = collecotor;

doc.Range.Replace(new Regex(@"([\w\.\-]+)@([\w\-]+)((\.(\w){2,3})+)"), "", opt);

foreach (string email in collecotor.EMails)
    Console.WriteLine(email);

private class EmailsCollecotor : IReplacingCallback
{
    public ReplaceAction Replacing(ReplacingArgs args)
    {
        string email = args.Match.Value.ToString();
        if (!mEMails.Contains(email))
            mEMails.Add(args.Match.Value.ToString());
        return ReplaceAction.Skip;
    }

    public List<string> EMails
    {
        get { return mEMails; }
    }

    private readonly List<string> mEMails = new List<string>();
}

Please note, Aspose.Words it designed to work with MS Word documents at first and loading PDF document is supported only in .NET and Python versions of Aspose.Words.

Aspose.PDF is the product designed to work with PDF documents, my colleagues from Aspose.PDF team will guide you how to achieve the same using Aspose.PDF shortly.

asad.ali · November 2, 2023, 12:44am

@alexey.noskov

With Aspose.PDF for .NET, you can use below sample code to extract email address from a PDF:

// Load the PDF document
Document pdfDocument = new Document("input.pdf");

// Create a regular expression pattern to match email addresses
string pattern = @"\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,4}\b";
Regex regex = new Regex(pattern);

// Create a TextFragmentAbsorber with the regular expression
TextFragmentAbsorber absorber = new TextFragmentAbsorber(pattern);

// Accept text fragments that match the regular expression
absorber.TextSearchOptions = new TextSearchOptions(true);

// Search for email addresses in the PDF
pdfDocument.Pages.Accept(absorber);

// Extract and print the email addresses
foreach (TextFragment textFragment in absorber.TextFragments)
{
    if (regex.IsMatch(textFragment.Text))
    {
        Console.WriteLine("Email Address: " + textFragment.Text);
    }
}

tim330i · November 2, 2023, 9:13pm

Awesome! Thank you for two great suggestions.

I planned to do this through calls to an API. Is there an option to do that?

Tim

asad.ali · November 2, 2023, 9:27pm

This topic has been moved to the related forum: API example of this - https://products.aspose.app/email/extractor - Free Support Forum - aspose.cloud

asad.ali · November 2, 2023, 9:27pm