Sunday, April 27, 2014

Merge Excel sheets ?

No problem!

A friend of mine had a problem, she had to merge two [huge] Excel workbooks, by matching names in one, with the names in another. Something like:

WorkBook1
+---------+---------+----------+----------+
| Name1   |  field1 |  field2  |  ...     |
+---------+---------+----------+----------+
|  ...    |         |          |          |
+---------+---------+----------+----------+
|  ...    |         |          |          |
+---------+---------+----------+----------+

WorkBook2
+---------+---------+----------+----------+
| field2_1| field2_2|  Name2   |  ...     |
+---------+---------+----------+----------+
|  ...    |         |          |          |
+---------+---------+----------+----------+
|  ...    |         |          |          |
+---------+---------+----------+----------+


If they were Database tables, then we could have done something on the lines of:

SELECT * FROM WorkBook1,WorkBook2 WHERE Name1=Name2;

But, these are excel sheets, and worse yet, the names in WorkBook1 are in format:
"FirstName LastName"
and names in WorkBook2 are in format:
"LASTNAME, FIRSTNAME"

(i.e, uppercase, and with comma) Duh! and there will be many names with 3-4 words - imagine the permutations.

Excel experts could say this can be solved with some cool macros, or maybe VB scripts!, but I am old-school Unix text guy!. I think in terms of text filters only!

To solve the problem of name matching, Take this [hypothetical] name:

Shankaran Ganesh Krishnan

the permutations will be:

Shankaran Krishnan Ganesh
Krishnan Shankaran Ganesh
Krishnan Ganesh Shankaran
Shankaran Ganesh Krishnan
Ganesh Shankaran Krishnan
Ganesh Krishnan Shankaran

Some names can also contain the initials [with a period], like:

Ganesh K. Shankaran

So, how can we do the name matching ? ... for a moment I thought of using a permuter
and then saving all permutations (stupid!), but that's not required!

Lets say we do the following,
 - Remove dots and commas
 - Change to lowercase (trim spaces too)
 - Sort the names by words

If you had "Shankaran, Ganesh Krishnan" in WorkBook1, and "GANESH, SHANKARAN KRISHNAN" in WorkBook2, then both will become: "ganesh krishnan shankaran"

Now, the only problem that remains, is to save the .xls as .csv, so that I can load it to Perl (Parse::CSV). Unfortunately, Excel doesn't have an option to save all the sheets in the workbook
at once to CSVs, I had to do that manually for each sheet and then merge. Other than that, its pretty straight forward.

If you are about to say: Show me teh codez!
here you go ...
#!/usr/bin/perl
use Parse::CSV;
use strict;
my %wb2_table = ();
my $workbook2_data = Parse::CSV->new ( file => 'merged_workbook2.csv' );
# Note: the header lines are removed in the CSV files for simplicity
# Parse one workbook and load it to a hash
while (my $ref = $workbook2_data->fetch) {
my @line = @$ref;
my $name = $line[5]; # 5th field is the name here
$name =~ s/[.,]/ /g; # cleanup
next if $name =~ /^$/;
my @nsplits = split(/ /, lc($name)); # lowecase
my $cn = join(' ', sort @nsplits); # canonicalize (sort n join)
$cn =~ s/^\s+|\s+$//g; # trim
$wb2_table{$cn} = $ref;
}
# Now load the second workbook and match in the hash
my $workbook1_data = Parse::CSV->new( file => ($ARGV[0] or 'merged_workbook1.csv') );
while (my $ref = $workbook1_data->fetch ) {
my @line = @$ref;
my $name = $line[1]; # field 1 is the name here
$name =~ s/[.,]/ /g; # cleanup
my @nsplits = split(/ /, lc($name)); # lowercase it and split
my $cn = join (' ', sort @nsplits); # canonicalize (sort n join)
$cn =~ s/^\s+|\s+$//g; # trim
if (exists($wb2_table{$cn})) { # try matching in the hash
my $ar = $wb2_table{$cn};
my @line_wb2 = @$ar;
$" = '","'; # Ensure quoting is correct
print "\"@line\",\""; # the current CSV row
print "\"@line_wb2\",\""; # CSV row of the earlier workbook
print "\"\n";
}
}
view raw csv-merge.pl hosted with ❤ by GitHub
What good is coding skills, if you cannot put it to use, at the right time, to help friends!? :-)


No comments: