Apostrophe(quotation mark) extracted from word document is not the same as mysql database

kits · January 4, 2021, 1:10pm

I have a program which extracts headings from word documents and uses a database to see if all the strings from the database exist in the word document. My problem is that there is one heading with a apostrophe(quotation mark) and the extracted apostrophe is not the same as the one in the database.

so the string in the database = schema’s
and the string extracted from the document = schema’s but the apostrophe is just a little bit different

I hope someone knows how to fix this.

tahir.manzoor · January 4, 2021, 6:01pm

@kits

Please ZIP and attach your input Word document and text from MySQL (in notepad file) here for testing. We will investigate the issue and provide you more information on it.

kits · January 4, 2021, 6:43pm

on page 9 of the word document in the ZIP you will find the heading i am trying to check for(schema’s).
this does work with all the other headings but not this one.

when i extracted the heading:“schema’s” from the word document into a richTextbox and copied “schema’s” from the database under there into the
same richTextbox i could see there is a small difference in the apostrophe character.

i hope you can work with the information i have given.

apostrophe.zip (703.7 KB)

tahir.manzoor · January 5, 2021, 7:36am

@kits

Please check the following text. In the Word document, the apostrophe character looks as shown below:

Schema’s

The Node.ToString(SaveFormat.Text) method extracts the text correctly. Please check the attached image.
apostrophe.png (15.3 KB)

If you are comparing text in MySQL query, please use two single quotes. Please read
How to escape apostrophe (') in MySQL?

If you are comparing apostrophe and single quote, you can use String.Replace method to replace them with each other.
apostrophe and single quote.png (1.5 KB)

Moreover, the .NET API returns true for these characters when compare them.

Console.WriteLine("’ and '" == "’ and '");
Console.WriteLine("’ and '".Equals("’ and '"));

VictoriaSanderson · June 2, 2023, 12:51pm

This can happen due to different encoding standards or variations in how the apostrophe is represented.

To fix this issue, you can try the following approaches:

Use a text normalization or cleaning function to replace different types of apostrophes with a standard one. You can use a library like unicodedata in Python to normalize the text. Here’s an example:

import unicodedata

database_string = "schema’s"
extracted_string = "schema’s"

normalized_database_string = unicodedata.normalize('NFKC', database_string)
normalized_extracted_string = unicodedata.normalize('NFKC', extracted_string)

if normalized_database_string == normalized_extracted_string:
    print("Strings match!")

Instead of relying on exact string matching, you can use fuzzy matching algorithms to compare the similarity between strings. Fuzzy matching algorithms can handle slight variations in characters. The fuzzywuzzy library in Python provides several fuzzy matching functions you can use.

from fuzzywuzzy import fuzz

database_string = "schema’s"
extracted_string = "schema’s"

similarity_ratio = fuzz.ratio(database_string, extracted_string)

if similarity_ratio >= 90:  # Adjust the threshold as needed
    print("Strings match!")

These approaches should help you handle minor differences in character representation and successfully compare the strings extracted from the Word document with the ones extracted with data enrichment tools.