Fuzzywuzzy: String Similarity Matching in Python
Having issues finding some of the duplicates in your dataset? Are you treating “Donald Trump” and “Donald J. Trump” as two different strings? Look no further than Fuzzywuzzy!
Fuzzywuzzy is a Python library originally developed by SeatGeek that helps to identify similar strings. The development of this library stems from the lack of consistency in labeling across the internet. SeatGeek wanted to compile event tickets from everywhere around the web but, often times the same event would be listed with minor differences such as extra details, a hyphen, spacing, spelling errors, etc.
Using another Python library by the name of difflib, SeatGeek was able to develop and open source multiple functions within Fuzzywuzzy which we will learn later. The basics of Fuzzywuzzy can be described with the formula for Levenshtein Distance.
Levenshtein Distance
Vladimir Levenshtein, a Russian mathematician, originally considered this metric back in 1965. The Levenshtein Distance is a metric that tells us the “distance” between two strings. In other words, it measures the amount of operations (changes) that are required to change one string into another string.
These operations include:
- Addition
- Substitution
- Deletion
Each of the above operations is treated equally. Meaning, adding a letter and substituting a letter will both add one count to the total number of operations. The same logic also applies to deletion. These differences can be applied to any differences including punctuation, capitalization, etc. Let’s make this a bit easier and look at a specific example.
The Levenshtein distance between the words ‘Honda’ and ‘Hyundai’ is equal to 3. ‘Y’ is added to the string, ‘O’ is converted to a ‘U’, and a final ‘I’ is added to the end of the string. Regardless of which letters you choose to add or substitute, the total count is 3.
Let’s get started!
A simple install function to get the ball rolling.
pip install fuzzywuzzy
Now, let’s import some modules.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
The below functions will return integers that represent a number from 0–100 based on similarity ratios.
Simple Ratio
Let’s see what happens when you run the simple ratio function.
String1 = "Hello World"
String2 = "Hello World!"fuzz.ratio("Hello World", "Hello World!")>>> 96String1 = "Hello world"
String2 = "Hello World!"fuzz.ratio("Hello world", "Hello World!")>>> 84
So as we can see, the Simple Ratio is pretty straightforward. In the first calculation, the only difference is the “!”.
In the second calculation, there is now a difference of a capitalization in the letter “W” and a missing “!”.
The ratios in the above functions have a denominator of total length of characters in both strings.
Partial Ratio
The first example seemed a bit strict. Not allowing for any room for mistakes in punctuation, capitalization, or missing characters. Our next function gives us a bit more breathing room. The Partial Ratio runs a comparison on the two strings but this time returns the score based on substring matching.
String1 = "Hello World"
String2 = "Hello"fuzz.partial_ratio("Hello World", "Hello")>>> 100String1 = "Hello World"
String2 = "World Hello"fuzz.partial_ratio("Hello World", "World Hello")>>> 62
In the above calculation, since “World” is a sub-string of “Hello World”, we receive a score of 100.
The above function is a bit too simplistic at times and can be easily foiled. A simply altered arrangement of the words in a substring will quickly spoil the score of the above calculation but no need to worry, as Fuzzywuzzy has us covered.
Token Sort
The Token Sort method will split the string into tokens, sort them alphabetically, and rejoin them into a string again. During the tokenization process, words are converted to lower case and all punctuation is removed. After this process is done, the strings are once again compared with the aforementioned Simple Ratio function.
String1 = "Hello World"
String2 = "World Hello"fuzz.token_sort_ratio("Hello World", "World Hello")>>> 100
Token Set
The final approach is very similar to the previous in that it will take the strings provided and once again tokenize them. This time, however, we will not sort the strings immediately after. Instead, we organize the tokens into two separate groups, the intersection and the remainder. Now, we compare.
t0 = sorted_intersection
t1 = sorted_intersection + remainder_string1
t2 = sorted_intersection + remainder_string2
The token_set_ratio() function runs the ratio() function on each of the above values and returns the MAX value.
#Compare all values to one another
fuzz.ratio(t0, t1)
fuzz.ratio(t0, t2)
fuzz.ratio(t1, t2)
Let’s see an example with words.
string1 = "yankees vs dodgers"
string2 = "new york yankees at los angeles dodgers"t0 = "dodgers yankees"
t1 = "dodgers yankees vs"
t2 = "dodgers yankees angeles at los new york"
fuzz.ratio(t0, t1) = 91
fuzz.ratio(t0, t2) = 50
fuzz.ratio(t1, t2) = 61
fuzz.token_set_ratio(string1,string2) = 91 #Equal to Max Value of above 3
Conclusion
The Fuzzywuzzy library has many tools that will help us calculate the similarity between two strings. The four above functions are all useful in their own respective ways. They all serve their own purposes when it comes to scoring string similarities within your dataset. Make sure to understand your dataset and its background prior to making a choice between the four!