How to Find Date from Word document with Regex using Java

Hi, I am evaluating if Aspose work for my use case. Thank you in advance for your help!
I want to:

  1. identify a date in a word document using Regex
  2. convert the date into a data format

This is a sample of the document I work with:


MyAddress
MyAddress
MyAddress
2EE

09 November 2020

Reference: 1234

RE: Lorem ipsum

TO WHOM IT MAY CONCERN

This is the regex to extract the date:
(?<=2EE\n\n)(\b\d.{0,2}\s\w*\s\d{4})

So far I have managed to extract the date and replace it with another string, following examples online.
What I want to do however is to get the date string and parse it into a date.

This is my code do far:

FindReplaceOptions options = new FindReplaceOptions();
    options.setReplacingCallback(new ReplaceWithHtmlEvaluator(options));
    Pattern pattern = Pattern.compile("(?<=2EE\n\n)(\b\d.{0,2}\s\w*\s\d{4})");

    document.getRange().replace(pattern, "", options);

private class ReplaceWithHtmlEvaluator implements IReplacingCallback {
	    ReplaceWithHtmlEvaluator(final FindReplaceOptions options) {
	        mOptions = options;
	    }
	    public int replacing(final ReplacingArgs e) throws Exception {
	    	
	        e.setReplacement("hello!");

	        return ReplaceAction.REPLACE;
	    }

	    private FindReplaceOptions mOptions;
	}

@arthurg85

In your code, you are replacing date with text ‘hello!’.

Could you please share some more detail about your requirement?

The date may be text or date field in MS Word document. You can find the date (text) from document and replace it with desired content using Range.Replace method.

However, if your date is a field, we suggest you please use Range.Fields property to get the field collection. Iterate over this collection and find the date field using Field.Type property. You can move the cursor to the date field and insert your desired content and remove date field.

Could you please ZIP and attach your input and expected output Word documents? We will then provide you more information about your query.

Thank you for your reply.

Apologies, I made a typo in my question. I meant:
I want to grab a date string from my word document using regex, and convert that string in java into a DATE format.

I attach an example document input as requested.

At the moment the only working solution I found works this way:

  1. get all text from document as a string

    String fullText = document.getRange().getText();

  2. created a re-useable method to extract text with Regex (not using Aspose, just normal java)

     public class TextExtractor {
     public TextExtractor() {
     	
     }
     
     public String extract(String text, String regex) {
     	Pattern pattern= Pattern.compile(regex);
     	Matcher matcher= pattern.matcher(text);
     	while (matcher.find()) {
     		return matcher.group();	
     	}
     	return null;
     }
    

    }

  3. created a method that takes my date string and converts it to date format

     public class ConvertDate {
     public ConvertDate() {
     	
     }
     
     public Date convert(String dateString) throws ParseException {
     	TextExtractor textExtractor = new TextExtractor();
     	
     	String day = textExtractor.extract(dateString, "\\b\\d{1,2}(?=[^0-9]{1})");
     	String month = textExtractor.extract(dateString, "\\b[A-Z]+[a-z]+");			
     	String year = textExtractor.extract(dateString, "\\b\\d{4}\\b");
     
     	String concatenate= day+" "+month+" "+year;	
     	Date dateResult = new SimpleDateFormat("dd MMMM yyyy").parse(concatenate);
     	
     	return dateResult;
     }
    

    }

I am fairly sure there must be a better way with Aspose. Let me know if you have any suggestions,
thank you!

@arthurg85

In your case, we suggest you please check the code example shared in the following article. The code example finds the text and highlight it.

How to Find and Highlight Text

In the code example, you need to replace your code with following code snippet. Please use Run.getText method to get the text of matched run nodes.

// Now highlight all runs in the sequence.
for (Run run : (Iterable<Run>) runs)
    run.getFont().setHighlightColor(Color.YELLOW);