Fuzzy Match with Power Query
A few years ago, I wrote a post about how to perform fuzzy lookups using the “Fuzzy Lookup Add-In for Excel.” However, this capability is now available in Power Query!! Wait, what? Yes! Performing a fuzzy match is so much easier and far more intuitive in Power Query. This post walks through the basics … and a huge THANK YOU to the Microsoft developers for this feature 🙂
Before we get to the mechanics, let’s back up and understand what we are trying to do and what a fuzzy match is.
Generally speaking, Excel lookup functions expect that the lookup values match on both lists. The case can be different (eg, “Microsoft” and “microsoft” would match), but generally, the idea is that the text strings need to be the same. There are some exceptions to this as noted below, but generally, the idea is that the lookup values are the same between both lists.
When the lookup values are different, for example ABC Company and ABC Company, Inc., traditional lookup functions aren’t always reliable. Now, to us humans, these two names represent the same underlying company. But to Excel, these represents different lookup values. The idea with a fuzzy lookup is to enable Excel to match them.
For example, here is a table (named CompanyName) with some company names:
We would like Excel to match them to this table (named CompanyList):
You’ll notice there are some differences: ABC Company vs ABC Company, Inc., Micro soft vs Microsoft, and Xeon vs Xeon Products.
With a traditional lookup function such as VLOOKUP, they don’t match:
With Power Query’s fuzzy merge option, they do:
Now, let’s get to the mechanics.
Note: some lookup functions provide an option for “approximate” match. For example, when you set the 4th argument of VLOOKUP to TRUE, or the 3rd argument of MATCH to 1 or -1, it enables us to perform range lookups. And, depending on the data set, lookup values, and sort order, this may or may not provide the expected approximate match desired. Additionally, some lookup functions support the use of a wildcard which has helped in some situations as well. But, we can now use the built-in fuzzy merge option in Power Query to perform a true fuzzy match 🙂
We will walk through this process in three steps:
- Import List 1
- Import List 2
- Fuzzy merge
Let’s do this thing.
Note: depending on when you are reading this and your version of Excel, you may not have the fuzzy merge option, which was recently released via O365 subscription (Office Insiders update channel).
Import List 1
We begin by getting our first list (CompanyName) into Power Query by selecting the table (or other data source) and using the Data > From Table/Range command. A preview is displayed in the Power Query Editor:
Next, we Home > Close & Load To … a connection only query … and no need to load it into the data model:
Now, it is time to get our second list.
Import List 2
We basically rinse and repeat to import the lookup table (CompanyList). We again Close & Load To … and create a connection only query.
At this point, we should see both queries listed in the Queries & Connections pane on the right side of the Excel window:
With our first and second tables in Power Query, we can now perform the fuzzy merge.
To start this step, from within Excel we select Data > Get Data > Combine Queries > Merge. The Merge dialog appears, and we select our first list (CompanyName) and our second list (CompanyList). We also identify the lookup columns by clicking the column header from both, as shown below:
You’ll notice the bottom of the screenshot says, “The selection matches 0 of 3 rows from the first table.” This is because ABC Company is not the same as ABC Company, Inc., and so on.
But … and here is the good part … you’ll notice a checkbox called Use fuzzy matching to perform the merge.
Check it and … interesting! We now have 2 of 3 rows that match:
Wow! Now let’s see which items matched by clicking OK.
Note: don’t worry, we’ll explore Fuzzy merge options shortly.
Back in the Power Query Editor, we Expand the CompanyList Table field and we can see the matches below:
Our first two companies match … awesome!
Can we get our third company, Xeon, to match as well? Time to explore a few fuzzy merge options. Click the gear icon on the Source applied step and expand the Fuzzy merge options:
- The Similarity threshold field allows us to specify how similar the two values must be to make a match. The range is 0 to 1, and you can set as desired based on your data.
- The Ignore case is checked by default (so MICROSOFT will match microsoft).
- The Match by combining text parts is checked by default (so Micro soft would match Microsoft).
- The Maximum number of matches allows us to set how many matches are returned for each lookup value.
- The Transformation table allows us to specify a separate “mapping table” that increases reliability for lookups we perform frequently (more on this below).
Let’s change the Similarity threshold to 0.5, and bam … 3 of 3 items now match:
We click OK and … yes … it worked!
We Close & Load To … a Table, and the results are now back in Excel:
And that, my friends, is a fuzzy lookup in Excel!
Now, one last note. If we perform a similar lookup on similar lists over and over, we can increase the reliability of our match by incorporating a transformation “mapping” table.
To create a Transformation Table, you create a new table with a From and To column, as shown below:
Then, you pull it into Power Query (Data > From Table/Range) and save the query as a connection-only query as we did previously.
Now, we open the Merge dialog by editing the fuzzy merge query (double-click the merge query in the Queries & Connections pane). Once again, click the gear icon in the Source applied step. Expand Fuzzy merge options and select your table in the Transformation table drop-down:
Think of this table as providing a reliable match for known differences, and if a value can’t be found in it, then Power Query uses the fuzzy match logic.
Well, for me, this is a very exciting capability in Power Query as it solves a common issue that has historically been hard to solve. Now that it is built-in to Power Query, it is easy to use and will really help in practice. Again, thanks Microsoft for this gift 🙂
Sample File: FuzzyPQ.xlsx
Excel is not what it used to be.
You need the Excel Proficiency Roadmap now. Includes 6 steps for a successful journey, 3 things to avoid, and weekly Excel tips.
Want to learn Excel?
Access all Undergrad and Masters lessons with a Campus Pass or CPE Pass. Includes on-demand training plus live office hours.
Hi Jeff, This is great to know. Not having to change case in a column to make a match with another column is one less thing to do! I am guessing that the fuzzy logic works on more than one column match? I. E. Matching 2 or more columns in table 1 vs 2 or more columns in table 2. Eg Name and address.
Lastly is there a quick way of showing which rows were a match because of fuzzy logic being applied. (you could do a separate normal table merge match and then compare… )
I’m currently on the latest version of office 365 ProPlus and i don’t see the fuzzy option appear when i follow these steps, is there some setting or add on I’m missing?
Thanks in advance.
O365 has a few different update channels, and these determine when you receive updates. So, if you want to receive updates and new features quickly, update your channel to office insiders 🙂
This is very helpful! What if I have only one table and I want to identify all the similar results in only one column (for example Company Name column and I want to merge all the similar names, or I want to see all the possible matches for each of the entries in my column).
Using the Fuzzy Lookup Excel add-in was one of the outstanding breakthroughs in my data-matching struggles way back in 2013 +/- !!
I use the add-in quite often and almost can’t do without it in my work!
So… seeing this option in Power Query was a thing of joy! 🙂
However, I have noticed a couple glitches / shortcomings in the PQ version of fuzzy matching. I have yet to figure out whether it’s something I don’t know yet, or whether it’s a Power Query issue.
(i) In Power Query, there doesn’t seem to be a way to limit the number of matches to 1 from BOTH tables as it seems to do in the Excel add-in.
When I keep the max count of matches as 1 in a Left Outer merge, it does restrict the number of records returned from the Left table to 1, but it returns a multiple count of records from the Right table — often with “false positive” matches.
In the Excel add-in, the same setting returns excellent results and hardly any false positives! (Of course, with the same similarity threshold.)
(ii) In the Excel add-in, there is an option to return the probability of a match in the “Similarity” column. This gives us a very good way to sort the output table by descending order of the similarity percentage.
I haven’t noticed such an option in Power Query.
However, despite these apparent shortcomings, I can still see myself using Fuzzy Matching in Power Query more often simply because the of the ease of setting up the queries and refreshing the data with a couple of clicks!
Beware of false positives. Using a 0.9 fuzzy rating (where 1 is an “exact” match) on a dataset where we had a list of names displayed in a variety of ways, PQ pulled “Titan” for “Twitch” and “SAVE” for “Pave” but couldn’t find the Inc. matches or LLCs. So by having to drop down to ~0.5 to get it to recognize Inc., etc., the number of false positives would be far too many, especially considering the number of false positives at 0.9 was already at an unacceptable level.
It is helpful in certain cases, I imagine; however, I’ve never found a scenario where I don’t need confidence that my data is completely accurate. In the case above, using the optional parameters in XLOOKUP actually made the lookup much easier. Thankfully the dataset was only a few hundred rows long.