No problem!
A friend of mine had a problem, she had to merge two [huge] Excel workbooks, by matching names in one, with the names in another. Something like:
WorkBook1
+---------+---------+----------+----------+
| Name1 | field1 | field2 | ... |
+---------+---------+----------+----------+
| ... | | | |
+---------+---------+----------+----------+
| ... | | | |
+---------+---------+----------+----------+
WorkBook2
+---------+---------+----------+----------+
| field2_1| field2_2| Name2 | ... |
+---------+---------+----------+----------+
| ... | | | |
+---------+---------+----------+----------+
| ... | | | |
+---------+---------+----------+----------+
If they were Database tables, then we could have done something on the lines of:
SELECT * FROM WorkBook1,WorkBook2 WHERE Name1=Name2;
But, these are excel sheets, and worse yet, the names in WorkBook1 are in format:
"FirstName LastName"
and names in WorkBook2 are in format:
"LASTNAME, FIRSTNAME"
(i.e, uppercase, and with comma) Duh! and there will be many names with 3-4 words - imagine the permutations.
Excel experts could say this can be solved with some cool macros, or maybe VB scripts!, but I am old-school Unix text guy!. I think in terms of text filters only!
To solve the problem of name matching, Take this [hypothetical] name:
Shankaran Ganesh Krishnan
the permutations will be:
Shankaran Krishnan Ganesh
Krishnan Shankaran Ganesh
Krishnan Ganesh Shankaran
Shankaran Ganesh Krishnan
Ganesh Shankaran Krishnan
Ganesh Krishnan Shankaran
Some names can also contain the initials [with a period], like:
Ganesh K. Shankaran
So, how can we do the name matching ? ... for a moment I thought of using a permuter
and then saving all permutations (stupid!), but that's not required!
Lets say we do the following,
- Remove dots and commas
- Change to lowercase (trim spaces too)
- Sort the names by words
If you had "Shankaran, Ganesh Krishnan" in WorkBook1, and "GANESH, SHANKARAN KRISHNAN" in WorkBook2, then both will become: "ganesh krishnan shankaran"
Now, the only problem that remains, is to save the .xls as .csv, so that I can load it to Perl (Parse::CSV). Unfortunately, Excel doesn't have an option to save all the sheets in the workbook
at once to CSVs, I had to do that manually for each sheet and then merge. Other than that, its pretty straight forward.
If you are about to say: Show me teh codez!
here you go ...
What good is coding skills, if you cannot put it to use, at the right time, to help friends!? :-)
A friend of mine had a problem, she had to merge two [huge] Excel workbooks, by matching names in one, with the names in another. Something like:
WorkBook1
+---------+---------+----------+----------+
| Name1 | field1 | field2 | ... |
+---------+---------+----------+----------+
| ... | | | |
+---------+---------+----------+----------+
| ... | | | |
+---------+---------+----------+----------+
WorkBook2
+---------+---------+----------+----------+
| field2_1| field2_2| Name2 | ... |
+---------+---------+----------+----------+
| ... | | | |
+---------+---------+----------+----------+
| ... | | | |
+---------+---------+----------+----------+
If they were Database tables, then we could have done something on the lines of:
SELECT * FROM WorkBook1,WorkBook2 WHERE Name1=Name2;
But, these are excel sheets, and worse yet, the names in WorkBook1 are in format:
"FirstName LastName"
and names in WorkBook2 are in format:
"LASTNAME, FIRSTNAME"
(i.e, uppercase, and with comma) Duh! and there will be many names with 3-4 words - imagine the permutations.
Excel experts could say this can be solved with some cool macros, or maybe VB scripts!, but I am old-school Unix text guy!. I think in terms of text filters only!
Shankaran Ganesh Krishnan
the permutations will be:
Shankaran Krishnan Ganesh
Krishnan Shankaran Ganesh
Krishnan Ganesh Shankaran
Shankaran Ganesh Krishnan
Ganesh Shankaran Krishnan
Ganesh Krishnan Shankaran
Some names can also contain the initials [with a period], like:
Ganesh K. Shankaran
So, how can we do the name matching ? ... for a moment I thought of using a permuter
and then saving all permutations (stupid!), but that's not required!
Lets say we do the following,
- Remove dots and commas
- Change to lowercase (trim spaces too)
- Sort the names by words
If you had "Shankaran, Ganesh Krishnan" in WorkBook1, and "GANESH, SHANKARAN KRISHNAN" in WorkBook2, then both will become: "ganesh krishnan shankaran"
Now, the only problem that remains, is to save the .xls as .csv, so that I can load it to Perl (Parse::CSV). Unfortunately, Excel doesn't have an option to save all the sheets in the workbook
at once to CSVs, I had to do that manually for each sheet and then merge. Other than that, its pretty straight forward.
If you are about to say: Show me teh codez!
here you go ...
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/perl | |
use Parse::CSV; | |
use strict; | |
my %wb2_table = (); | |
my $workbook2_data = Parse::CSV->new ( file => 'merged_workbook2.csv' ); | |
# Note: the header lines are removed in the CSV files for simplicity | |
# Parse one workbook and load it to a hash | |
while (my $ref = $workbook2_data->fetch) { | |
my @line = @$ref; | |
my $name = $line[5]; # 5th field is the name here | |
$name =~ s/[.,]/ /g; # cleanup | |
next if $name =~ /^$/; | |
my @nsplits = split(/ /, lc($name)); # lowecase | |
my $cn = join(' ', sort @nsplits); # canonicalize (sort n join) | |
$cn =~ s/^\s+|\s+$//g; # trim | |
$wb2_table{$cn} = $ref; | |
} | |
# Now load the second workbook and match in the hash | |
my $workbook1_data = Parse::CSV->new( file => ($ARGV[0] or 'merged_workbook1.csv') ); | |
while (my $ref = $workbook1_data->fetch ) { | |
my @line = @$ref; | |
my $name = $line[1]; # field 1 is the name here | |
$name =~ s/[.,]/ /g; # cleanup | |
my @nsplits = split(/ /, lc($name)); # lowercase it and split | |
my $cn = join (' ', sort @nsplits); # canonicalize (sort n join) | |
$cn =~ s/^\s+|\s+$//g; # trim | |
if (exists($wb2_table{$cn})) { # try matching in the hash | |
my $ar = $wb2_table{$cn}; | |
my @line_wb2 = @$ar; | |
$" = '","'; # Ensure quoting is correct | |
print "\"@line\",\""; # the current CSV row | |
print "\"@line_wb2\",\""; # CSV row of the earlier workbook | |
print "\"\n"; | |
} | |
} |