Sunday, April 27, 2014

Merge Excel sheets ?

No problem!

A friend of mine had a problem, she had to merge two [huge] Excel workbooks, by matching names in one, with the names in another. Something like:

WorkBook1
+---------+---------+----------+----------+
| Name1   |  field1 |  field2  |  ...     |
+---------+---------+----------+----------+
|  ...    |         |          |          |
+---------+---------+----------+----------+
|  ...    |         |          |          |
+---------+---------+----------+----------+

WorkBook2
+---------+---------+----------+----------+
| field2_1| field2_2|  Name2   |  ...     |
+---------+---------+----------+----------+
|  ...    |         |          |          |
+---------+---------+----------+----------+
|  ...    |         |          |          |
+---------+---------+----------+----------+


If they were Database tables, then we could have done something on the lines of:

SELECT * FROM WorkBook1,WorkBook2 WHERE Name1=Name2;

But, these are excel sheets, and worse yet, the names in WorkBook1 are in format:
"FirstName LastName"
and names in WorkBook2 are in format:
"LASTNAME, FIRSTNAME"

(i.e, uppercase, and with comma) Duh! and there will be many names with 3-4 words - imagine the permutations.

Excel experts could say this can be solved with some cool macros, or maybe VB scripts!, but I am old-school Unix text guy!. I think in terms of text filters only!

To solve the problem of name matching, Take this [hypothetical] name:

Shankaran Ganesh Krishnan

the permutations will be:

Shankaran Krishnan Ganesh
Krishnan Shankaran Ganesh
Krishnan Ganesh Shankaran
Shankaran Ganesh Krishnan
Ganesh Shankaran Krishnan
Ganesh Krishnan Shankaran

Some names can also contain the initials [with a period], like:

Ganesh K. Shankaran

So, how can we do the name matching ? ... for a moment I thought of using a permuter
and then saving all permutations (stupid!), but that's not required!

Lets say we do the following,
 - Remove dots and commas
 - Change to lowercase (trim spaces too)
 - Sort the names by words

If you had "Shankaran, Ganesh Krishnan" in WorkBook1, and "GANESH, SHANKARAN KRISHNAN" in WorkBook2, then both will become: "ganesh krishnan shankaran"

Now, the only problem that remains, is to save the .xls as .csv, so that I can load it to Perl (Parse::CSV). Unfortunately, Excel doesn't have an option to save all the sheets in the workbook
at once to CSVs, I had to do that manually for each sheet and then merge. Other than that, its pretty straight forward.

If you are about to say: Show me teh codez!
here you go ...
What good is coding skills, if you cannot put it to use, at the right time, to help friends!? :-)


No comments: