Wednesday, April 16, 2014

Recognize, and transform text


Many a times, I see some text, which is not in any known format. But, it looks vaguely familiar, or simple enough to transform. The reason I would want to transform, is of course to work with it, to load it into my scripting environment, to analyze, consume or apply some complex programmatic logic to it.

Here, I give some examples, and show the conversions. This could help in recognizing
raw text,  and transforming them to their closest [known] cousins.

Case 1

Lets start with something simple. If you see a file something like this:

Alice:
sal=20000
age=23 
role=engineer

Bob:
sal=21000
age=28           
role=engineer

and you want to load this into your preferred programming environment (like a
Python dict, Lua table, or a Perl hash) to work with it. As it stands, it is not in a format that is directly usable!, but if we can make a small change in the data, say change the line containing colon in the end, to [line].

[Alice]
sal=20000
age=23 
role=engineer
  
[Bob]   
sal=21000
age=28           
role=engineer

Now, this is a valid .ini file format (popular in the Windows world). And, there are libraries
for most languages to load and work with INI files!

What you need, is a little Perl, or sed regex to convert from the former to the latter!. And
dont think about Jamie's popular quote and be afraid (for such simple cases, regex is a
good fit, but make sure you really understand regexes to weild one when needed)

Case 2

If you have seen some router configs (like JUNOS config), or some BibTeX entries, then the following
will be faimilar:

interface {
    eth0 {
      ip4  10.1.1.2;
      bia  aa:11:22:11:00:11;
    }
}

Again, this may not be directly loadable into your environment, but see this again, doesnt it look
close to JSON ?, all you need to do is to ensure that the keys and values are quoted correctly

As JSON:

{
 "interface" : {
    "eth0" : {
      "ip4" : "10.1.1.2",
      "bia" : "aa:11:22:11:00:11"
    }
 }
}

Or, Lua table:

interface = {
    eth0 = {
      ip4 = '10.1.1.2',
      bia = 'aa:11:22:11:00:11'
    }
}

again, both of these can be achieved with minimal changes.

Case 3

This might look very similar to Case 1, but observe the nesting and a richer data set!

[Alice]
sal=20000
age=23
role=[current=engineer;previous=DevOps,TAC]

[Bob]  
sal=21000
age=28          
role=[current=engineer;previous=]

Now, converting this to .ini doesn't seem to fit!, can we convert it to something else? say, I do this:

Alice:
  sal: 20000
  age: 23
  role:
        current: engineer
        previous:
                 - DevOps
                 - TAC

Bob:
  sal: 21000
  age: 28
  role:
        current: engineer
        previous:

Aha, now this is valid YAML! YAML, like JSON, is also fat-free-XML! and you have libraries
for all languages to load and work with YAML.

Case 4

We all know CSV, if you have seen simple spread-sheet data (think MS Excel), that's valid CSV. Also, spread-sheet editors give you an option to save it as plain CSV.

But what if the data were like this:

 a:b:c:"d"
or
 a|b|"c"|d

isn't it simple to change the delimiter to comma (',')? so that, you can work with CSV libraries.
Bonus - if you have to send the data to a suit, just attach it and they can open in a spreadsheet-editor! you know, suits frown on plain text attachments! :-/

Note: the regex should be careful enough to handle quoting! (that applies to all cases listed above)

To summarize, you don't need complicated parser to load text into your favorite language, to analyze it, or to apply programmatic transformations to it. All you need, is to recognize the format, and check which is the closest known format to which you can convert it to, so that you can conveniently work with it. The following table might make it easier to remember:

       
Text Easily converted to
Delimited (line oriented) CSV
Grouped, and simple key-value INI
Indented, multi level, with lists YAML
Brace nested, and key-value JSON/Py-dict/Lua-table

No comments: