Adventures in Machine Learning — Part 1
How does one get started writing a Blog?
Flavio suggested to me recently, “You should write a blog about ECL to give the community an additional resource to learn more about it.” So I said, “OK, I know quite a bit about ECL, so what specifically are you suggesting I write about?” And he replied, “Machine Learning.”
Well, writing about Machine Learning would be great, if I actually knew anything about it, but I don’t. So here’s what I propose to do: post a series of articles chronicling my “adventures” in learning to do Machine Learning in ECL.
First off, I needed a resource to learn from, so I bought my first book on Machine Learning:
“Machine Learning for Hackers” by Drew Conway and John Myles White (O’Reilly)
Copyright 2012 Drew Conway and John Myles White
ISBN: 1449303714
This excellent (and highly recommended) book presents the topic from a programmer’s perspective that I can understand — explaining what Machine Learning is used for and why, by way of a ton of example code that shows you exactly how to get it all done. My only “problem” is that all the code examples in the book are in a language called R that I am completely unfamiliar with.
The authors said that R was developed by statisticians for statisticians, and I am definitely NOT a statistician, so my plan is to translate their R example code into ECL and run it against their example data to see if I can duplicate their result. Then I’ll discuss in each article how the ECL version accomplishes the same task as the authors’ R code, except it will be running on our massively parallel processing platform (HPCC) instead of a single Windows box (as it will do when I run their R code to see what it does).
The ultimate purpose of this blog is to teach ECL in a practical manner to everybody, so I will reference the R example code by chapter and page number but will not be quoting it from the book. To anyone already familiar with R, these articles should help you to learn ECL — I suggest you buy a copy of the book to see exactly the R code I’m translating. To everybody else — I suggest you buy a copy of the book to read the discussion of the Machine Learning techniques the code demonstrates. Either way, it is well worth the investment.
Chapter One
This chapter revolves mostly around introducing R to anyone who is unfamiliar with it, so there are no specific Machine Learning techniques used. However, it does provide an interesting introduction to the R-to-ECL translation process. The authors have structured the book so that their introduction of R also introduces a number of standard database techniques (such as data standardization, cleansing and exploration) which is one of the major reasons I find their approach so intuitive, since my programming background brings me to Machine Learning from a database application development direction.
The example data used in this chapter is UFO sighting data that the authors acquired from the Infochimps.com website. I’m using the file that came with the R code download that I got from the OREILLY.COM website. It is a CSV-type file using the tab character (“t”) as the field delimiter.
Spraying the Data
The first bit of example code on page 14 simply reads the data file and displays the records. In ECL, before you can use a file you must first spray it onto your cluster then define the file. I sprayed the UFO data file to my cluster using the Spray CSV page in ECL Watch, specifying the t Separator and giving it the name ‘~ml::ufodata’ then defined the file like this:
EXPORT File_UFO := MODULE
EXPORT Layout := RECORD
STRING DateOccurred;
STRING DateReported;
STRING Location;
STRING ShortDescription;
STRING Duration;
STRING LongDescription{MAXLENGTH(25000)};
END;
EXPORT File := DATASET('~ml::ufodata',Layout,CSV(SEPARATOR('\t')));
END;
I put this code in my “ML_Blog” directory and named the code file “File_UFO” (the file name of the .ECL file storing the code must always be the same as the name of the EXPORT definition contained in that file, in this case “File_UFO”) so I can easily reference it later. I used the MODULE structure to organize the RECORD structure and the DATASET definitions in the same code file
Each field is defined as a variable-length STRING to begin with, because we don’t yet know how much data might be in any given field. Before spraying, I opened the file in a text editor to look at the data. I scrolled through the file to see how long the longest record was, which is how I determined the MAXLENGTH to apply to the final field (I could see that the rest of the fields were fairly short). I need to have the MAXLENGTH explicitly defined to override the 4K default that would have been if I had not. For the field names, I simply duplicated what the authors used in the second bit of example code at the bottom of page 14.
Next, I varied a bit from the script the authors proposed and did a bit of standard ECL-style data exploration to determine exactly what the size of each field should be. I opened a new builder window and ran this code:
IMPORT ML_Blog;
ML_Blog.File_UFO.File;
COUNT(ML_Blog.File_UFO.File); //61393
MAX(ML_Blog.File_UFO.File,LENGTH(TRIM(DateOccurred))); //8
MAX(ML_Blog.File_UFO.File,LENGTH(TRIM(DateReported))); //8
COUNT(ML_Blog.File_UFO.File( LENGTH(DateOccurred) 8 OR
LENGTH(DateReported) 8)); //254
OUTPUT(ML_Blog.File_UFO.File( LENGTH(DateOccurred) 8 OR
LENGTH(DateReported) 8),ALL);
MAX(ML_Blog.File_UFO.File,LENGTH(TRIM(Location))); //70
MAX(ML_Blog.File_UFO.File,LENGTH(TRIM(ShortDescription))); //954
MAX(ML_Blog.File_UFO.File,LENGTH(TRIM(Duration))); //31
MAX(ML_Blog.File_UFO.File,LENGTH(TRIM(LongDescription))); //24679
The IMPORT ML_Blog makes all my EXPORTed definitions in the ML_Blog directory available for use. The next line shows me the first 100 records in the file, then the COUNT tells me the total number of records (61,393 in the file I downloaded).
I wrote the two MAX functions followed by COUNT and OUTPUT because the authors alerted me to a possible data corruption error, based on the error message they got when they first tried to format the DateOccurred field into a standard R date field (page 15). The MAX functions both returned 8, so the file I downloaded must have been updated since the authors worked with it. However, the COUNT function tells me there are 254 records that do not contain a standard YYYYMMDD date string, so the OUTPUT shows me those records.
The ALL option on my OUTPUT allows me to see all the records (not just the first 100), so I can see that the data corruption is the presence of “0000” as the DateOccurred in 254 records. Since then number of “bad” records I found does not match the number the authors cited, I can only assume that the file download I got is a newer/better file than the authors worked with.
The rest of the MAX functions tell me how much data is actually contained in each field. Using the information gained, I can now edit the file definition to this:
EXPORT File_UFO := MODULE
EXPORT Layout := RECORD
STRING8 DateOccurred;
STRING8 DateReported;
STRING70 Location;
STRING ShortDescription{MAXLENGTH(1000)};
STRING31 Duration;
STRING LongDescription{MAXLENGTH(25000)};
END;
EXPORT File := DATASET('~ml::ufodata',Layout,CSV(SEPARATOR('\t')))(DateOccurred '0000');
END;
I left the two description fields as variable-length because that will allow the system to make the most efficient use of storage. There is no problem with mixing fixed and variable-length fields in any of the file formats (flat-file, CSV, or XML). I also added a filter to the DATASET definition to eliminate the records with bad dates.
Date Conversion and Filtering Bad Records
This next section deals with data cleansing and standardization. These are the kind of standard data operations that always need to be done in any/every data shop. This is one of the first steps taken in any operational database to ensure that you’re not working with “garbage” data, and that the format of your data is the same in each record.
Since the authors are simply filtering out the corrupt records (data cleansing), I can do the same.
To re-format the date strings into the ISO basic format (%Y%m%d) that the authors use on page 15 (data standardization), I need to create a new recordset with the fields converted to an explicit date field format. For that, I need to use ECL’s Standard Date Library functions, like this:
IMPORT $, STD;
ds := $.File_UFO.File;
Layout := RECORD
STD.Date.Date_t DateOccurred := STD.Date.FromString(ds.DateOccurred,'%Y%m%d');
STD.Date.Date_t DateReported := STD.Date.FromString(ds.DateReported,'%Y%m%d');
ds.Location;
ds.ShortDescription;
ds.Duration;
ds.LongDescription;
END;
EXPORT CleanedUFO := TABLE(ds,Layout);
This code is stored in a file named “CleanedUFO” (again, the file name of the .ECL file storing the code must always be the same as the name of the EXPORT definition contained in that file).
The re-definition of “$.File_UFO.File” to “ds” is done simply to make the rest of the code a little easier to read. The STD.Date.Date_t data type used here is just a re-definition of UNSIGNED4 to reflect that the contents of the binary field will be an integer comprised of the numeric value represented by a YYYYMMDD date. The STD.Date.FromString() function converts a YYYYMMDD date string to the UNSIGNED4 binary value, reducing the storage requirement by half, so that a “20120101” date becomes the integer value 20,120,101.
I ran a quick test of this code, but did nothing more with it because I can pretty easily combine this with the next two steps and get all the “work” done in one job.
Parsing the Location Field
The authors want to split the data in the Location field into separate City and State fields (data standardization again). There are a number of ways in ECL to accomplish that, but I chose to write a simple FUNCTION structure to handle that job and add it into the CleanedUFO code from above, like this:
IMPORT $, STD;
ds := $.File_UFO.File;
SplitLocation(STRING Loc) := FUNCTION
LocLen := LENGTH(Loc)+1;
CommaPos := LocLen - STD.Str.Find(STD.Str.Reverse(Loc),',',1);
RetLoc := MODULE
EXPORT STRING70 City := IF(CommaPos=LocLen,Loc,Loc[1..CommaPos-1]);
EXPORT STRING10 State := IF(CommaPos=LocLen,'',TRIM(Loc[CommaPos+1..],LEFT));
END;
RETURN RetLoc;
END;
Layout := RECORD
STD.Date.Date_t DateOccurred := STD.Date.FromString(ds.DateOccurred,'%Y%m%d');
STD.Date.Date_t DateReported := STD.Date.FromString(ds.DateReported,'%Y%m%d');
ds.Location;
ds.ShortDescription;
ds.Duration;
ds.LongDescription;
STRING70 City := SplitLocation(ds.Location).City;
STRING10 State := SplitLocation(ds.Location).State;
END;
EXPORT CleanedUFO := TABLE(ds,Layout);
In looking at the data, I noticed that the State value always comes last and is delimited from the City value by a comma. I also noticed that some Location field values had multiple commas, so I wrote my SplitLocation() function to specifically find the last comma and split the text at that point. I could have used the STD.Str.SplitWords() function from the ECL Standard Library to accomplish this part, but I decided that writing my own FUNCTION would provide a better teaching example for this Blog.
To find the last comma, my SplitLocation() function determines the length of the passed string and adds 1 to that. I’m adding 1 because I need to get the inverse value to determine the actual position of the last comma in the string. I’m reversing the string text using STD.Str.Reverse(), then using the STD.Str.Find() function to find the position of the first comma in the reversed string. Subtract that position from the length + 1 and voila: there’s the actual position of the last comma.
The next “trick” I’m using here is making the FUNCTION RETURN a MODULE structure, to allow it to return multiple values. Usually, functions return only a single value, but this “trick” makes it possible to have as many return values as you need (in this case, two: City or State). Then within the MODULE structure I’m using the IF() function to determine the actual return values. In both cases, my IF condition is CommaPos=LocLen, which, if true, indicates that there was no comma found in the string. If no comma was found, then the City field gets the input returned and the State returns blank.
I’m then using my SplitLocation() function to populate two additional fields in my result recordset, just as the authors have done in their R code.
Limiting Data to US States
The last bit of cleaning and standardization is in the example code on page 18, where the authors define the set of valid US states and then filter the records so that only those records with valid state field values are included in the final result. Here’s how I did that, onc again simply expanding on the CleanedUFO TABLE definition, like this:
IMPORT $, STD;
ds := $.File_UFO.File;
SplitLocation(STRING Loc) := FUNCTION
LocLen := LENGTH(Loc)+1;
CommaPos := LocLen - STD.Str.Find(STD.Str.Reverse(Loc),',',1);
RetLoc := MODULE
EXPORT STRING70 City := IF(CommaPos=LocLen,Loc,Loc[1..CommaPos-1]);
EXPORT STRING10 State := IF(CommaPos=LocLen,'',TRIM(Loc[CommaPos+1..],LEFT));
END;
RETURN RetLoc;
END;
Layout := RECORD
STD.Date.Date_t DateOccurred := STD.Date.FromString(ds.DateOccurred,'%Y%m%d');
STD.Date.Date_t DateReported := STD.Date.FromString(ds.DateReported,'%Y%m%d');
ds.Location;
ds.ShortDescription;
ds.Duration;
ds.LongDescription;
STRING70 City := SplitLocation(ds.Location).City;
STRING10 State := SplitLocation(ds.Location).State;
END;
USstates := ['AK','AL','AR','AZ','CA','CO','CT','DE','FL','GA','HI','IA','ID','IL',
'IN','KS','KY','LA','MA','MD','ME','MI','MN','MO','MS','MT','NC','ND',
'NE','NH','NJ','NM','NV','NY','OH','OK','OR','PA','RI','SC','SD','TN',
'TX','UT','VA','VT','WA','WI','WV','WY'];
EXPORT CleanedUFO := TABLE(ds,Layout)(State IN USstates,DateOccurred0) : PERSIST('PERSIST::CleandUFOdata');
I just added the USstates SET definition, then appended a filter to the end of the TABLE function to always limit the CleanedUFO recordset to the valid records. The addition of the PERSIST Workflow Service on the TABLE definition simply ensures that the work happens only the first time we use the CleanedUFO table.
So to see the result, I ran this code in a separate builder window:
IMPORT ML_Blog;
ML_Blog.CleanedUFO;
The result looks very similar to the author’s result, shown on page 18.
Final Thoughts
That’s enough for this article. We’ll continue with the rest of this chapter in the next post. That’s when we’ll put this data to use, by exploring what we have, doing some data analysis, and creating visual representations of the results.