Search

Tuesday, 31 January 2012

Transforming Data Pt 2

In my last post I talked about a problem I had after I upgraded to 64 bit Windows and needed to archive a text based report output by an old DOS program which was now hosted in a virtual machine. I ended the post by giving an example of the report, which I replicate here:

CattleCode Popular
Date: monday 9 january 1912 Time: 10:49
Attentions: Previous balance

Cow Grp Feed 1 Feed 2 Feed 3
no. no. tot. bal. prev. tot. bal. prev. tot. bal. prev.
kg kg kg kg kg kg kg kg kg
96 0 3.7 3.8 0.2 0.0 0.0 0.0 0.0 0.0 0.0
113 0 3.1 3.6 2.4 0.0 0.0 0.0 0.0 0.0 0.0
359 0 2.5 2.6 0.1 0.0 0.0 0.0 0.0 0.0 0.0
497 0 5.1 5.6 0.8 0.0 0.0 0.0 0.0 0.0 0.0
590 0 1.0 0.9 0.3 0.0 0.0 0.0 0.0 0.0 0.0
764 0 2.8 2.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0
858 0 6.6 4.4 0.1 0.0 0.0 0.0 0.0 0.0 0.0
5105 0 1.9 2.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0
5181 0 2.4 1.6 0.1 0.0 0.0 0.0 0.0 0.0 0.0
5234 0 4.6 5.1 1.5 0.0 0.0 0.0 0.0 0.0 0.0
5275 0 6.0 4.4 0.6 0.0 0.0 0.0 0.0 0.0 0.0

I decided that I wanted an easy way to compare previous runs of the file, ideally by providing a side by side comparison of the results from different days that the report had been run and also keep that data for as long as I wanted. It's the kind of problem that a spreadsheet such as Microsoft Excel can easily solve with the data being displayed in different columns in the grid which can be added to as needed.

So having decided to store the data in Excel format there was an obvious problem, the key data needed to be extracted from the report before it could be stored in the Excel file. I've highlighted the data that I'm interested in in the example above. The values in red identify the cow, while the values in blue show the total ration for the period, the ration quantity for the current period and what was left over in the previous period.

To get at that data I decided to use a technology called Regular Expressions. This is a technology that allows you to build text based expressions to parse text based data. Here is the expression that I used:
\n\s*(\d{1,4})\s+\d\s*([\d\.]{1,4})\s*([\d\.]{1,4})\s*([\d\.]{1,4})
it looks like utter nonsense but each part serves a purpose to identify the required data.

The first part \n\s* tells the regular expression parser to look for a new line with zero or more spaces. Which will be followed by 1 to 4 digits which is what this means (\d{1,4}), \d indicates we're looking for a number and {1,4} indicates the minimum and maximum number of digits to expect. The round brackets around the expression tell the parser that that qualifying text should be treated as a numbered group, given it's position it will be group 1.

The following \s+ indicates that one or more spaces will follow. The next part, \d\s* tells the parser to expect a number followed by more spaces. We then have 3 expressions which are essentially the same ([\d\.]{1,4})\s* what this means is that the parser should look for a digit or a fullstop for the next 1 to 4 characters, in other words we're looking for an upto 4 character decimal value. This is followed by zero or more spaces. Each of the decimal value expressions is contained within round brackets and in total give us another 3 numbered groups.

In practice this expression is passed into a parser along with the report text and it outputs a set of matches, each match having 4 groups which represent the 4 fields that we are interested in. The first group identifies the cow, the second provides the ration for the period, the third the current balance for the period and the fourth the left overs from the previous period.

With that information extracted the remaining problem is how to store it in Excel, I'll discuss that next time.

Comments
To leave a comment please login
Register