Make Fuzzy Data Safe by Modeling your Matching Rules

We tie data to dimensions using matching rules that are either deterministic or probabilistic.

Deterministic matching is what we use to find an ‘exact’ match. A system used by Human Resources might have employee ‘Aerith Gainsborough’ which matches ‘Aerith Gainsborough’ in our dimension. In this case we looked for an exact value in the dimension to determine a match. Unfortunately our work is hardly ever this easy. We often have to define complex business rules and implement them in our transform logic.

Probabilistic matching is sometimes called ‘fuzzy’ matching. If HR has entries for “Aerith Gainsborough” and also entries for “Aeris Gainsborough”, a probabilistic algorithm might indicate that they are both the same person.

The deterministic approach is usually preferred, but sometimes the source data is very poor quality or so dissimilar from our conformed dimension that it yields a very low matching rate. For example, we might only be able to match 40% of employees in HR to our dimension.

When this happens we are sometimes pushed to take a probabilistic approach.

This is dangerous if the data is going to be used for accounting purposes. We don’t want to pay an employee for hours that are ‘probably’ theirs, regardless of how much it improves their morale (remember that another employee somewhere would be on the losing end!)

On the other hand, if we are analyzing some kind of generic employee behavior, such as “Are over-worked employees more likely to quit?”, then exact matches may not be a critical requirement. But if the data is there, the danger still exists that someone may start using it for financial purposes. After all, how would they know that the data was matched ‘probabilistically’ instead of ‘deterministically’?

A way to mitigate this risk, and still allow both kinds of analysis, is to include your matching rules as a dimension in your model. This way, if someone is using the data for a financial purpose, they can slice on ‘exact matches only’, while the data scientists can take a broader scope of data.

The first step is to define the matching rules and put them in a dimension table:

The matching rules will be applied during the transform process when we tie the source data to the employee dimension (or whatever your particular situation is). Because the most deterministic matches have the most analytical possibilities, we want to prefer those matches. We also do this because the fuzzy matches cast a ‘wider net’, and could make the deterministic matching rate appear so low that it looks useless.

So in the transform logic apply the best and most deterministic matches first, followed by progressively probabilistic matches. The probabilistic matches may also be performed in a ‘best-to-worst’ order if some of them have a higher perceived value than others.

In the end, each row in your fact table will be conformed to your dimension, as well as the matching rule that was used.

Now we can slice our data by the type of matches:

We can also measure the hit rate of each type of match:

This allows us to achieve the best of both worlds. We have exact matches that can be used where accuracy matters, and less exact matches for cases where it’s not so important.