This post explores Excel’s lookup functions, approximate matches, fuzzy lookups, and exact matches. The built-in Excel lookup functions, such as VLOOKUP, are amazing. When implemented in the right way for special projects or in recurring use workbooks, they are able to save a ton of time. The VLOOKUP function alone has saved countless hours in my recurring use workbooks. However, the VLOOKUP function, similar to Excel’s other lookup functions such as HLOOKUP and MATCH, is built to perform an exact match or a range lookup. Both of these are quite different from an approximate match or a fuzzy lookup. This post discusses the details of these ideas, and demonstrates how to perform a fuzzy lookup in Excel 2010 and later.

### Understanding Built-In Lookup Functions

The built-in Excel lookup functions, such as VLOOKUP, HLOOKUP, and MATCH, work with similar lookup logic. To simplify this post, we’ll use just one as the example. Since the VLOOKUP function is probably the most used and most familiar lookup function, we’ll use it as we explore these ideas.

The basic idea of an Excel lookup function is to look for a value in a list. For example, we could ask Excel to find “ABC Company” in a list of customer names. That is the basic idea, but the application of lookup functions are numerous and the implementations can become quite sophisticated and powerful.

For this post, I’d like to split the tasks that a lookup function performs into two steps. I’ll call step one *the match*, and step two *the return*. In the first step, the match, Excel must find the matching value. You tell Excel the value to find, such as “ABC Company” and you tell Excel where to look, such as in a range of cells. You are asking Excel to find the lookup value in the lookup range.

Step two, the return, is the function’s result. That is, what value the function should return to the cell. Some lookup functions, such as the MATCH function, tell Excel to return the position number. Other lookup functions, such as the VLOOKUP function, tell Excel to return a related value. So, based on which lookup function you select, and which function argument values you enter, Excel knows what to return once it finds its match. So far so good?

Let’s do a quick example at this point.

I would like to find a specific customer name “ABC Company” in a list of customers, and if found, I would like Excel to return the customer id which is found in the next column.

I would use a VLOOKUP function, and I would ask it to find “ABC Company” in the Customer Table, and return the ID. Assuming the customer name was entered in C7, and the customers were stored in a Table named Table1, then the following function would do the trick:

**=VLOOKUP(C7, Table1, 2, FALSE)**

**Where**:

**C7**is the value to find**Table1**is the lookup range**2**is the column that has the value we wish to return**FALSE**means we are not performing a range lookup

This function is entered in C8 in the screenshot below.

As you can see, the ID AC100 was successfully returned to the formula cell C8. And that my friend is the basic idea of the VLOOKUP function. Find a value (the match) and compute the result (the return).

It is important to note that the lookup value, the text string “ABC Company” must be found in the lookup range. Except for case (upper and lower), the two values must match exactly. “ABC Company” would not match “ABC Company, Inc.”, “ABC Co”, or “ABC Company “. No leading spaces, no trailing spaces, no extra abbreviations or characters. They must be the same. This is called an exact match. If the value is not the same, the function will not match it, and you’ll get an error, as shown in the screenshot below.

Now that we have covered the basics, it is time to explore the VLOOKUP’s fourth argument.

### The Truth about the VLOOKUP Fourth Argument

The fourth argument of the VLOOKUP function is officially named: range_lookup. It is a boolean argument, meaning you can pass it a value of TRUE or FALSE, or any other representation of TRUE or FALSE. The thing that tends to mislead Excel users is the description that Microsoft used for these options. Excel describes the TRUE value as “Approximate Match” and FALSE as “Exact Match.” A clearer description would have been something like TRUE “You are doing a range lookup” and FALSE “You are not doing a range lookup” but in any event, the descriptions are what they are.

When you select TRUE (Approximate Match) you are not asking Excel to match values that are approximately the same as each other. The description Approximate Match would tend to imply that the function would match “ABC Company” and “ABC Company, Inc.” since they are approximately the same name. In some cases and in some data sets, this idea would work. But this idea does not work in all cases, and thus, can’t be relied upon in our workbooks. For example, in the screenshot below, the function did not find a match between “ABC Company” and “ABC Company, Inc.” as evidenced by the incorrect ID returned in C8:

In the following screenshot however, the function did find a match between “ABC Company” and “ABC Co” as evidenced by the expected ID returned to C8:

The way that the function actually works when TRUE is selected is this: it walks down the list row by row, and ultimately stops on the row that is less than the value and where the next row is greater than the value. This is why the lookup range must be sorted in ascending order for the function to return an accurate result when the fourth argument is TRUE.

This idea can be confusing when thinking about text strings, but makes more sense when thinking about numbers. For example, when trying to find the correct commission rate based on the sales value. In this case, you want to perform a range lookup. You want to look up a value from within a range. This is illustrated in the screenshot below.

The function walks down row by row trying to determine which row to stop on. It continues down until it finds a row that is greater than the lookup value, and then it stops on the previous row. It stops on the row that is less than the value, and where the next row is greater than the lookup value. This is pretty easy to understand when thinking about numbers, but can be harder to visualize when thinking about text strings. The key to understanding this function argument however, is to realize that the logic is identical when operating on text strings and numbers. This is why “ABC Company” does not match “ABC Company, Inc.”, because “ABC Company Inc.” is *greater than* ABC Company. This is why “ABC Company” will match “ABC Co”, because “ABC Co” is *less than* “ABC Company.” As you can see, this is not what we have in mind when thinking about approximate match.

### What is a Fuzzy Lookup aka Approximate Match

An approximate match, to us, means that two text strings that are about the same, but not necessarily identical, should match. For example, “ABC Company” should match “ABC Company, Inc.,” “ABC Co,” and “ABC Company .” We think about an approximate match as kind of fuzzy, where some of the characters match but not all.

The idea of a fuzzy lookup is that the values are not a clear match, they are not identical. But that they are likely a match, there is a probability that they are a match. They likely represent the same underlying entity.

Now that we realize the VLOOKUP function does not truly perform approximate match logic, at least, not in the way we want it, what do we do?

### Add-In

When you hit a wall, go around it. Since the built-in lookup functions do not perform fuzzy logic when performing the match, we hit a built-in limitation of Excel. Microsoft has offered a way to work around this limitation by offering a free add-in.

Microsoft offers a free add-in that enables Excel to perform fuzzy lookups. It is called “Fuzzy Lookup Add-In for Excel” and is available at the time of this post at the link below:

http://www.microsoft.com/en-us/download/details.aspx?id=15011

Once installed, this add-in performs fuzzy lookups. It does not change the behavior of any of the built-in lookup functions. It does not enable your VLOOKUP functions to perform fuzzy lookups. It is an add-in which basically processes two lists and computes the probability of a match.

You specify the two tables, and within each table the columns to inspect. Basically, you define step one *the match*. You then define step two by identifying which columns from the tables should be included in the result. You can also specify the probability threshold. You hit go, and the add-in performs its work, and then outputs the resulting table starting at the active cell. It basically generates a static report based on the settings you select.

Here is a screenshot of the output, showing that it successfully matched “ABC Company” and “ABC Company, Inc.” in the same data set that caused our VLOOKUP function to fail.

For more information about the fuzzy lookup add-in, and more detail on how to use it, please visit the Microsoft link above. The add-in comes with instructions, a sample Excel file, and a pdf file with background and the logic it uses to do its magic. There is some extremely interesting computer science and math working behind the scenes, including Jaccard similarity, tokenization of records, and transformations. Pretty heavy mathematics in there.

Thanks Microsoft Research for this add-in!!

Jeff,

This is one of the most impressive informational and detailed instruction posts I’ve ever see on the web. Short and to the point, with wonderful examples, and well thought out. GREAT job!

Thanks

Do you believe this add-in is able to fuzzy lookup on about 70 columns?

Luciano – I’ve not tried it on a data set of 70 columns, so I’m not sure…but I’d love for you to let me know what you find out when you give it a try!

Thanks

Jeff

man you are amazing. great article,, its very useful,, thanks a lot for your work ,, also thanks microsoft

I thought that this was a wonderful add on for excel. With my job i find i need to find partial matches in the same column of a table and identify them. This seemed to answer my prayers. However i am finding a big issue, which i am hoping is fixable.

When i do a fuzzylookup on two columns and one has “apple” and the other “apple:” it brings back a perfect match not a partial match. The system is viewing this as a perfect match, it is ignoring the “:”

This makes the add on useless as this sort of partial match is required to be identified is required as well as “colour” vs “color”. Any thoughts on how this maybe fixed? If it is, this would be perfect. Are there any other free add ons that i could use that would work.

Thank you

Donnchadh

Donnchadh,

I’m unaware of add-in settings or options that would allow you to modify the built-in behavior to meet your needs. I modified the Similarity Threshold setting and various others, but none seemed to produce the desired result. Thus, you may need to solve it in two steps. The first step is using the add-in to generate the results table and the second step is using a formula to let you know if the two items in the results table are identical. For example, assuming that the apple/apple: results were in cells C10 and D10, you could write a formula such as:

=C10=D10

You could fill this formula down and it would provide TRUE for rows that were an exact match and FALSE otherwise.

Thanks,

Jeff

Would it be possible to have the fuzzy lookup match up using two columns wherein one column is fuzzy matching but the other one is a strict match? I’m doing it for name and birthdate where name doesn’t have to completely match but birthdate must be a perfect match. I’ve played around with the column configuration settings but the lookup is still returning records where the birthdate isn’t matching. Thanks.

Dennis,

I believe the add-in allows you to provide a “Transformation Threshold” for individual columns via the Column Configuration options. The default Column Configuration options are Default, PhoneNumber, SSN, and ZipCode. You can actually add a new custom configuration. On the bottom of the Add-In panel, you’ll see a Configure… button which opens the Configure dialog. In the Column Configurations tab, you can click the Add button to create a new configuration, for example, BDate. Set the EditTransformationThreshold value to 1 which means exact match. Then, close the Configuration dialog. In the Fuzzy Lookup panel, you want to select the two Name columns and then click the match icon to push the selection down into the Match Columns list box. Set the configuration for that one to say Default, which is a fuzzy match. Then, select the BDate columns from both tables and click the match icon to push the selection down into the Match Columns list box. Set the configuration to say BDate, the custom configuration you just made that uses 1 for exact match. Then, delete any other Match Column entries that may have been added automatically, so that you only have the two we added manually.

I believe this will provide the result you are seeking…thanks!

Thanks

Jeff

Thank you very much

Great post, informative, to the point, well written, and all around very helpful.

Thanks.

Hi Jeff

Your post is very informative, thank you for that, I am trying to match a string of 9 numbers from a list to a list that contains text and the numbers. What function would you suggest to match items in the cell and ignore the rest? Not sure if this exists but considering my two tables have over 120k lines each, this could save me enormous amount of time.

Thank you

Pat

Patrice,

One approach to tell if the 9-number string appears in the list that contains text and numbers is with the COUNTIFS function. Since the COUNTIFS function supports the use of wildcards, it is a fairly easy task to determine if there is a matching value or not.

For example, if your list of 9-number values are in column A, and the text/number list is in column G, you could write the following formula in Cell B1 and fill it down:

=COUNTIFS(G:G,”*”&A1&”*”)

This tells Excel to count the number of cells in G that contain the 9-number string in A1 even if it includes preceding or trailing text values. For rows where the function returns 0, it means the value can’t be found; any rows with 1 or more means it can be found in that number of cells. This approach assumes that the 9-numbers are sequential in the text/number string, for example 123456789 and AA123456789BB. It won’t work if the letters are in between the numbers, such as 1234A5678B9C.

If you need to return a related value, rather than just determine if the value exists on the other list, then you could use VLOOKUP along with wildcards instead. The following blog post explains more about using VLOOKUP with wildcards:

http://www.excel-university.com/vlookup-and-wildcards/

Hope it helps!

Thanks

Jeff