Transforming Data Pt 2
In my last post I talked about a problem I had after I upgraded to 64 bit Windows and needed to archive a text based report output by an old DOS program which was now hosted in a virtual machine. I ended the post by giving an example of the report, which I replicate here:
CattleCode Popular
Date: monday 9 january 1912 Time: 10:49
Attentions: Previous balance
Cow Grp Feed 1 Feed 2 Feed 3
no. no. tot. bal. prev. tot. bal. prev. tot. bal. prev.
kg kg kg kg kg kg kg kg kg
96 0 3.7 3.8 0.2 0.0 0.0 0.0 0.0 0.0 0.0
113 0 3.1 3.6 2.4 0.0 0.0 0.0 0.0 0.0 0.0
359 0 2.5 2.6 0.1 0.0 0.0 0.0 0.0 0.0 0.0
497 0 5.1 5.6 0.8 0.0 0.0 0.0 0.0 0.0 0.0
590 0 1.0 0.9 0.3 0.0 0.0 0.0 0.0 0.0 0.0
764 0 2.8 2.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0
858 0 6.6 4.4 0.1 0.0 0.0 0.0 0.0 0.0 0.0
5105 0 1.9 2.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0
5181 0 2.4 1.6 0.1 0.0 0.0 0.0 0.0 0.0 0.0
5234 0 4.6 5.1 1.5 0.0 0.0 0.0 0.0 0.0 0.0
5275 0 6.0 4.4 0.6 0.0 0.0 0.0 0.0 0.0 0.0
I decided that I wanted an easy way to compare previous runs of the file, ideally by providing a side by side comparison of the results from different days that the report had been run and also keep that data for as long as I wanted. It's the kind of problem that a spreadsheet such as Microsoft Excel can easily solve with the data being displayed in different columns in the grid which can be added to as needed.
So having decided to store the data in Excel format there was an obvious problem, the key data needed to be extracted from the report before it could be stored in the Excel file. I've highlighted the data that I'm interested in in the example above. The values in red identify the cow, while the values in blue show the total ration for the period, the ration quantity for the current period and what was left over in the previous period.
To get at that data I decided to use a technology called Regular Expressions. This is a technology that allows you to build text based expressions to parse text based data. Here is the expression that I used:
\n\s*(\d{1,4})\s+\d\s*([\d\.]{1,4})\s*([\d\.]{1,4})\s*([\d\.]{1,4})
it looks like utter nonsense but each part serves a purpose to identify the required data.
The first part \n\s* tells the regular expression parser to look for a new line with zero or more spaces. Which will be followed by 1 to 4 digits which is what this means (\d{1,4}), \d indicates we're looking for a number and {1,4} indicates the minimum and maximum number of digits to expect. The round brackets around the expression tell the parser that that qualifying text should be treated as a numbered group, given it's position it will be group 1.
The following \s+ indicates that one or more spaces will follow. The next part, \d\s* tells the parser to expect a number followed by more spaces. We then have 3 expressions which are essentially the same ([\d\.]{1,4})\s* what this means is that the parser should look for a digit or a fullstop for the next 1 to 4 characters, in other words we're looking for an upto 4 character decimal value. This is followed by zero or more spaces. Each of the decimal value expressions is contained within round brackets and in total give us another 3 numbered groups.
In practice this expression is passed into a parser along with the report text and it outputs a set of matches, each match having 4 groups which represent the 4 fields that we are interested in. The first group identifies the cow, the second provides the ration for the period, the third the current balance for the period and the fourth the left overs from the previous period.
With that information extracted the remaining problem is how to store it in Excel, I'll discuss that next time.
Posted by
Richard Isaac
on
31/01/2012
Transforming Data Pt 1
Just over a year ago I had a problem, I'd just bought a new PC and since it had 8Gb RAM it had to run 64bit Windows (see this post for an explanation on what that's all about). But 64 bit Windows doesn't support DOS based programs, now I'm happy for that to be the case as DOS is a very old technology and at some point I think that modern systems need to leave the past behind.
However, for the past 18 years I've been using a DOS program to monitor and update a computer that controls out of parlour feeders which feed our dairy herd. Unfortunately, despite several requests I've not been able to find a Windows version of the program and I've never found the time to try and reverse engineer the serial port based communications. So I'm stuck with this DOS program, but it was easy enough to overcome the lack of DOS support by running the program in a DOS virtual machine.
While this program has many reports, there's only ever been one which has been of any interest to me, this report tells me which cows didn't eat their full ration in the previous period and how much they had left over. Since I had an old printer plugged into my parallel printer port I was happy enough print the report on paper that I recycled for the purpose many time over (by putting in upside down or back to front I got about 4 passes out of each page, the report isn't usually that long). I was quite happy with that and could easily compare previous reports.
But at about the same time I ran out of clean space on the recycled paper I was using, so rather than use more paper I decided to go paper free and 'print' the report to a text file. The simplest way to achieve that was to have the VM direct parallel port output to a text file instead of my parallel printer port. Unfortunately, the VM reuses the same file everytime, offering to either append new data or overwrite the file so I had to come up with a way of archiving old reports.
That left me with a couple options, I could either archive each file after I'd 'printed' the report or, since it was a text file, transform the data into something else. I chose the latter as in the long run it was much less fuss and I could also engineer it so that I could have easy side by side comparisons.
In the next part I'll look at how the file could be parsed to get the key data out.
Here's an example of the file:
CattleCode Popular
Date: monday 9 january 1912 Time: 10:49
Attentions: Previous balance
Cow Grp Feed 1 Feed 2 Feed 3
no. no. tot. bal. prev. tot. bal. prev. tot. bal. prev.
kg kg kg kg kg kg kg kg kg
96 0 3.7 3.8 0.2 0.0 0.0 0.0 0.0 0.0 0.0
113 0 3.1 3.6 2.4 0.0 0.0 0.0 0.0 0.0 0.0
359 0 2.5 2.6 0.1 0.0 0.0 0.0 0.0 0.0 0.0
497 0 5.1 5.6 0.8 0.0 0.0 0.0 0.0 0.0 0.0
590 0 1.0 0.9 0.3 0.0 0.0 0.0 0.0 0.0 0.0
764 0 2.8 2.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0
858 0 6.6 4.4 0.1 0.0 0.0 0.0 0.0 0.0 0.0
5105 0 1.9 2.0 0.1 0.0 0.0 0.0 0.0 0.0 0.0
5181 0 2.4 1.6 0.1 0.0 0.0 0.0 0.0 0.0 0.0
5234 0 4.6 5.1 1.5 0.0 0.0 0.0 0.0 0.0 0.0
5275 0 6.0 4.4 0.6 0.0 0.0 0.0 0.0 0.0 0.0
Posted by
Richard Isaac
on
11/01/2012