Sunday, April 27, 2014

Merge Excel sheets ?

No problem!

A friend of mine had a problem, she had to merge two [huge] Excel workbooks, by matching names in one, with the names in another. Something like:

WorkBook1
+---------+---------+----------+----------+
| Name1   |  field1 |  field2  |  ...     |
+---------+---------+----------+----------+
|  ...    |         |          |          |
+---------+---------+----------+----------+
|  ...    |         |          |          |
+---------+---------+----------+----------+

WorkBook2
+---------+---------+----------+----------+
| field2_1| field2_2|  Name2   |  ...     |
+---------+---------+----------+----------+
|  ...    |         |          |          |
+---------+---------+----------+----------+
|  ...    |         |          |          |
+---------+---------+----------+----------+


If they were Database tables, then we could have done something on the lines of:

SELECT * FROM WorkBook1,WorkBook2 WHERE Name1=Name2;

But, these are excel sheets, and worse yet, the names in WorkBook1 are in format:
"FirstName LastName"
and names in WorkBook2 are in format:
"LASTNAME, FIRSTNAME"

(i.e, uppercase, and with comma) Duh! and there will be many names with 3-4 words - imagine the permutations.

Excel experts could say this can be solved with some cool macros, or maybe VB scripts!, but I am old-school Unix text guy!. I think in terms of text filters only!

To solve the problem of name matching, Take this [hypothetical] name:

Shankaran Ganesh Krishnan

the permutations will be:

Shankaran Krishnan Ganesh
Krishnan Shankaran Ganesh
Krishnan Ganesh Shankaran
Shankaran Ganesh Krishnan
Ganesh Shankaran Krishnan
Ganesh Krishnan Shankaran

Some names can also contain the initials [with a period], like:

Ganesh K. Shankaran

So, how can we do the name matching ? ... for a moment I thought of using a permuter
and then saving all permutations (stupid!), but that's not required!

Lets say we do the following,
 - Remove dots and commas
 - Change to lowercase (trim spaces too)
 - Sort the names by words

If you had "Shankaran, Ganesh Krishnan" in WorkBook1, and "GANESH, SHANKARAN KRISHNAN" in WorkBook2, then both will become: "ganesh krishnan shankaran"

Now, the only problem that remains, is to save the .xls as .csv, so that I can load it to Perl (Parse::CSV). Unfortunately, Excel doesn't have an option to save all the sheets in the workbook
at once to CSVs, I had to do that manually for each sheet and then merge. Other than that, its pretty straight forward.

If you are about to say: Show me teh codez!
here you go ...
What good is coding skills, if you cannot put it to use, at the right time, to help friends!? :-)


Wednesday, April 16, 2014

Recognize, and transform text


Many a times, I see some text, which is not in any known format. But, it looks vaguely familiar, or simple enough to transform. The reason I would want to transform, is of course to work with it, to load it into my scripting environment, to analyze, consume or apply some complex programmatic logic to it.

Here, I give some examples, and show the conversions. This could help in recognizing
raw text,  and transforming them to their closest [known] cousins.

Case 1

Lets start with something simple. If you see a file something like this:

Alice:
sal=20000
age=23 
role=engineer

Bob:
sal=21000
age=28           
role=engineer

and you want to load this into your preferred programming environment (like a
Python dict, Lua table, or a Perl hash) to work with it. As it stands, it is not in a format that is directly usable!, but if we can make a small change in the data, say change the line containing colon in the end, to [line].

[Alice]
sal=20000
age=23 
role=engineer
  
[Bob]   
sal=21000
age=28           
role=engineer

Now, this is a valid .ini file format (popular in the Windows world). And, there are libraries
for most languages to load and work with INI files!

What you need, is a little Perl, or sed regex to convert from the former to the latter!. And
dont think about Jamie's popular quote and be afraid (for such simple cases, regex is a
good fit, but make sure you really understand regexes to weild one when needed)

Case 2

If you have seen some router configs (like JUNOS config), or some BibTeX entries, then the following
will be faimilar:

interface {
    eth0 {
      ip4  10.1.1.2;
      bia  aa:11:22:11:00:11;
    }
}

Again, this may not be directly loadable into your environment, but see this again, doesnt it look
close to JSON ?, all you need to do is to ensure that the keys and values are quoted correctly

As JSON:

{
 "interface" : {
    "eth0" : {
      "ip4" : "10.1.1.2",
      "bia" : "aa:11:22:11:00:11"
    }
 }
}

Or, Lua table:

interface = {
    eth0 = {
      ip4 = '10.1.1.2',
      bia = 'aa:11:22:11:00:11'
    }
}

again, both of these can be achieved with minimal changes.

Case 3

This might look very similar to Case 1, but observe the nesting and a richer data set!

[Alice]
sal=20000
age=23
role=[current=engineer;previous=DevOps,TAC]

[Bob]  
sal=21000
age=28          
role=[current=engineer;previous=]

Now, converting this to .ini doesn't seem to fit!, can we convert it to something else? say, I do this:

Alice:
  sal: 20000
  age: 23
  role:
        current: engineer
        previous:
                 - DevOps
                 - TAC

Bob:
  sal: 21000
  age: 28
  role:
        current: engineer
        previous:

Aha, now this is valid YAML! YAML, like JSON, is also fat-free-XML! and you have libraries
for all languages to load and work with YAML.

Case 4

We all know CSV, if you have seen simple spread-sheet data (think MS Excel), that's valid CSV. Also, spread-sheet editors give you an option to save it as plain CSV.

But what if the data were like this:

 a:b:c:"d"
or
 a|b|"c"|d

isn't it simple to change the delimiter to comma (',')? so that, you can work with CSV libraries.
Bonus - if you have to send the data to a suit, just attach it and they can open in a spreadsheet-editor! you know, suits frown on plain text attachments! :-/

Note: the regex should be careful enough to handle quoting! (that applies to all cases listed above)

To summarize, you don't need complicated parser to load text into your favorite language, to analyze it, or to apply programmatic transformations to it. All you need, is to recognize the format, and check which is the closest known format to which you can convert it to, so that you can conveniently work with it. The following table might make it easier to remember:

       
Text Easily converted to
Delimited (line oriented) CSV
Grouped, and simple key-value INI
Indented, multi level, with lists YAML
Brace nested, and key-value JSON/Py-dict/Lua-table

Tuesday, April 08, 2014

FIGlets ?

I had used Unix banner many times, but I had never bothered to check, how other cool looking typefaces were generated. Most often, on starting up some open source server/daemon, you'd come across a banner like:

                 ____                                   
 _ __ ___  _   _|  _ \  __ _  ___ _ __ ___   ___  _ __  
| '_ ` _ \| | | | | | |/ _` |/ _ \ '_ ` _ \ / _ \| '_ \ 
| | | | | | |_| | |_| | (_| |  __/ | | | | | (_) | | | |
|_| |_| |_|\__, |____/ \__,_|\___|_| |_| |_|\___/|_| |_|
           |___/                                        

Though I was sure these were not manually typed on an editor!, I never probed much. For some reason, today I wanted to put one such banner for my daemon, at start-up. So, on some Google digging, I found the source - FIGlet fonts. 

But no need to install the figlet utility, instead, try this web app - TAAG (Text to ASCII Art Generator). And, if you are working on an application - add a FIGlet banner ;-)

--EDIT-- (30-apr)

After playing around and having fun with FIGlets, I learnt about TOIlet :) (now, wait, hold your imagination), its FIGlet + filters, and how colorful!.

Look at the project page, its much more than just colorful banners!