I have a program which extracts headings from word documents and uses a database to see if all the strings from the database exist in the word document. My problem is that there is one heading with a apostrophe(quotation mark) and the extracted apostrophe is not the same as the one in the database.
so the string in the database = schema’s
and the string extracted from the document = schema’s but the apostrophe is just a little bit different
Please ZIP and attach your input Word document and text from MySQL (in notepad file) here for testing. We will investigate the issue and provide you more information on it.
on page 9 of the word document in the ZIP you will find the heading i am trying to check for(schema’s).
this does work with all the other headings but not this one.
when i extracted the heading:“schema’s” from the word document into a richTextbox and copied “schema’s” from the database under there into the
same richTextbox i could see there is a small difference in the apostrophe character.
i hope you can work with the information i have given.
If you are comparing apostrophe and single quote, you can use String.Replace method to replace them with each other. apostrophe and single quote.png (1.5 KB)
Moreover, the .NET API returns true for these characters when compare them.
Console.WriteLine("’ and '" == "’ and '");
Console.WriteLine("’ and '".Equals("’ and '"));
This can happen due to different encoding standards or variations in how the apostrophe is represented.
To fix this issue, you can try the following approaches:
Use a text normalization or cleaning function to replace different types of apostrophes with a standard one. You can use a library like unicodedata in Python to normalize the text. Here’s an example:
Instead of relying on exact string matching, you can use fuzzy matching algorithms to compare the similarity between strings. Fuzzy matching algorithms can handle slight variations in characters. The fuzzywuzzy library in Python provides several fuzzy matching functions you can use.
from fuzzywuzzy import fuzz
database_string = "schema’s"
extracted_string = "schema’s"
similarity_ratio = fuzz.ratio(database_string, extracted_string)
if similarity_ratio >= 90: # Adjust the threshold as needed
print("Strings match!")
These approaches should help you handle minor differences in character representation and successfully compare the strings extracted from the Word document with the ones extracted with data enrichment tools.