Compare All Attributes of Two Tables

Recently I had to compare lots of sets of data. Determining if there are any differences at all, or if there are rows that exist in one but not the other, is easy. But I needed to compare every column, calculate match percentages, and report every difference.

I was able to take what would have been a lot of tedious query writing and create a standard way to do it. All you have to do is a little bit of prep on your two data sets, and these queries will compare, quantify, and report all mismatches.

Note: The sql code here is written for Netezza, but should be adaptable to other database platforms. You might have to remove the Damerau-Levenshtein Edit Distance function or add it in from an outside library.

For this article suppose two separate applications contain employee contact information. Much of the information is similar, but because each system’s data is entered and maintained separately, the attributes don’t all match.

create temp table T_CONTACT_A as
-- REGION Create T_CONTACT_A
	select	88 as EMPLOYEE_ID, 'Ryan' as NAME_1, 'Howard' as NAME_2, '56 Dunder Lane' as ADDRESS union
	select	23, 'Jan', 'Levinson', '88 Sandals Drive' union
	select	54, 'Toby', 'Flenderson', '909 Getty Street' union
	select	123, 'Stanley', 'Hudson', '2342 Papermill Road' union
	select	43, 'Kevin', 'Malone', '55 Scranton Street' union
	select	74, 'Karen', 'Filippelli', '878 Utica Place' union
	select	14, 'Todd', 'Packer', '70 Mifflin Meadow Road' union 	
	select	99, 'Holly', 'Flax', '4542 Schrute Drive';
-- END REGION
			
create temp table T_CONTACT_B as
-- REGION Create T_CONTACT_B
	select	88 as ID, 'Ryan' as FIRST_NAME, 'Howard' as LAST_NAME, '56 Dunder Ln.' as STREET_ADDRESS union
	select	23, 'Janet', 'Levinson', '88 Sandals Drive' union
	select	54, 'Toby', 'Flenderson', '909 Getty Street' union
	select	123, 'Stanley', 'Hudson', '2342 Papermill Road' union
	select	43, 'Kevin', 'Malone', '69 Scranton Street' union
	select	74, 'Karen', 'Filippelli', '878 Utica Place' union
	select	14, 'David', 'Wallace', '55 Easy Street';
-- END REGION

System A Contact Data

System A contact data

 

System B Contact Data

System B contact data

Step one is to determine how to tie the objects in each system together. They don’t have to have a match for every object, this will simply compare the objects that do have a match. In these systems the key for a contact is their employee number. Employee ‘88’ in system A should be ‘88’ in system B. Some systems may take a bit of research and analysis before you can figure out if there is a way to match the objects.

Next, pivot the attributes into key-value pairs. Notice I have standardized the attribute names and cleaned the values a little. It is up to you whether an upper/lower mismatch or trailing spaces matter for the comparison. (check your database’s collation!)

create temp table T_OBJECT_KEY_VALUE_SET_A as
-- REGION Create key-value pairs for set A
	select	EMPLOYEE_ID as OBJECT_KEY, 
		'FIRST NAME' as ATTRIBUTE, 
		upper(trim(NAME_1)) as VALUE 
	from 	T_CONTACT_A 
	union all
	select	EMPLOYEE_ID as OBJECT_KEY, 
		'LAST NAME' as ATTRIBUTE, 
		upper(trim(NAME_2)) as VALUE 
	from 	T_CONTACT_A 
	union all
	select	EMPLOYEE_ID as OBJECT_KEY, 
		'ADDRESS' as ATTRIBUTE, 
		upper(trim(ADDRESS)) as VALUE 
	from 	T_CONTACT_A;
-- END REGION

create temp table T_OBJECT_KEY_VALUE_SET_B as
-- REGION Create key-value pairs for set A
	select	ID as OBJECT_KEY, 
		'FIRST NAME' as ATTRIBUTE, 
		upper(trim(FIRST_NAME)) as VALUE 
	from 	T_CONTACT_B 
	union all
	select	ID as OBJECT_KEY, 
		'LAST NAME' as ATTRIBUTE, 
		upper(trim(LAST_NAME)) as VALUE 
	from 	T_CONTACT_B 
	union all
	select	ID as OBJECT_KEY, 
		'ADDRESS' as ATTRIBUTE, 
		upper(trim(STREET_ADDRESS)) as VALUE 
	from 	T_CONTACT_B;
-- END REGION

select	*
from	T_OBJECT_KEY_VALUE_SET_A;

select	*
from	T_OBJECT_KEY_VALUE_SET_B;

After this your work is done! The next steps use the object-key-value tables to complete the analysis, all you had to do is fit your data into these structures.

We’ll walk through the next steps to make sure they are understood.

This step creates a complete list of all the objects and their attributes to compare.

create temp table T_OBJECT_ATTRIBUTE_INDEX as
-- REGION Create join index
	select	distinct
		OBJECT_KEY,
		ATTRIBUTE
	from	T_OBJECT_KEY_VALUE_SET_A
	union
	select	distinct
		OBJECT_KEY,
		ATTRIBUTE
	from	T_OBJECT_KEY_VALUE_SET_B;
-- END REGION

This creates a count of the distinct entities so that we can calculate the total match percentage:

create temp table T_COMMON_OBJECT_COUNT as
-- REGION Get count of terminals that exist in both sets	
	select	count(distinct A.OBJECT_KEY) as COMMON_OBJECT_COUNT,
		count(*) as TOTAL_OBJECT_MATCH
	from	T_OBJECT_KEY_VALUE_SET_A A
		inner join T_OBJECT_KEY_VALUE_SET_B B
			on B.OBJECT_KEY = A.OBJECT_KEY;
-- END REGION

This calculates the match percentage of each attribute:

-- REGION Get matching % by attribute
	select	I.ATTRIBUTE,
		round(count(*) / (select cast(COMMON_OBJECT_COUNT as numeric(12,2)) from T_COMMON_OBJECT_COUNT) * 100, 2) as MATCH_PERCENTAGE
	from	T_OBJECT_ATTRIBUTE_INDEX I
		left join T_OBJECT_KEY_VALUE_SET_A A
			on A.OBJECT_KEY = I.OBJECT_KEY
			and A.ATTRIBUTE = I.ATTRIBUTE
		left join T_OBJECT_KEY_VALUE_SET_B B 
			on B.OBJECT_KEY = I.OBJECT_KEY
			and B.ATTRIBUTE = I.ATTRIBUTE
	where	A.VALUE = B.VALUE
	group by
		1;
-- END REGION

match percentage result

Match Result

And finally, this reports all the mismatched attributes.

-- REGION Report mismatched attributes.
	select	I.OBJECT_KEY,
		I.ATTRIBUTE,
		A.VALUE as SYSTEM_A_VALUE,
		B.VALUE as SYSTEM_B_VALUE,
		dle_dst(A.VALUE, B.VALUE) as DAMERAU_LEVENSHTEIN_EDIT_DISTANCE,
		100 - round(DAMERAU_LEVENSHTEIN_EDIT_DISTANCE / 
		(case when length(A.VALUE) > length(B.VALUE) then length(A.VALUE) else length(B.VALUE) end)::double * 100, 2) as SIMILARITY_PERCENTAGE
	from	T_OBJECT_ATTRIBUTE_INDEX I
		left join T_OBJECT_KEY_VALUE_SET_A A
			on A.OBJECT_KEY = I.OBJECT_KEY
			and A.ATTRIBUTE = I.ATTRIBUTE
		left join T_OBJECT_KEY_VALUE_SET_B B 
			on B.OBJECT_KEY = I.OBJECT_KEY
			and B.ATTRIBUTE = I.ATTRIBUTE
	where	A.VALUE != B.VALUE
	order by
		1, 3;
-- END REGION

Mismatch Attribute Result

Attribute mismatch details

Here I’m showing how similar the values are by comparing the max length to the Damerau-Levenshtein edit distance (dle_dst function). It’s not a complex analysis, just a simple check to help guide decisions. For example if one set is considered ‘authoritive’ then you might try overwriting values in the other set where a certain percentage is met, and researching the rest with the data or system owners.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s