Fri Aug 17, 2018 11:00 pm
Login Register Lost Password? Contact Us


Automatic ECL Record Generation from sprayed file

Comments or questions on structuring and organizing your data

Fri Aug 03, 2012 7:56 pm Change Time Zone

Is there a way to automatically generate a Record Layout from analyzing a sprayed file?
Using some fuzzy logic to get as close as possible? And then have a reviewer step through and make sure that all the automated data types are correct?
jamesb
 
Posts: 6
Joined: Fri Aug 03, 2012 7:51 pm

Sun Aug 05, 2012 2:05 am Change Time Zone

here is a perl script which is kind of a proof of concept of what I am asking about here.


Code: Select all
#!/usr/bin/perl

use strict;
use warnings;

use Data::Dumper;

open FILE, "<", "csv.txt" or die $!;

my @lines = <FILE>;


#could check for errors in the CVS file here. make sure that there are fields on the top
#of the file as well as that there are same amount of columns as there are row entry fields
#etc. more testing could be done before hand.

my @fields;
my %field_data;
my %args;
my @all_tokens;
my %possible_types;
foreach my $line (@lines) {
   
    unless($args{fields_flag}) {
        chomp $line;
        $line =~ s`"``ig; #remove any of the quotes around the fields.
        @fields = split(',',$line);
        $args{fields_flag} = 1; #set to 1 for true
        #print Dumper(@fields);
    }
    else {
        my @tokens = split(',',$line);
        chomp @tokens;
        for( my $index = 0; $index < @fields; ++$index) {
            $field_data{$fields[$index]} .= $tokens[$index].", ";
        }
    }
}

my @type_identifier_subs = (
    #sub {return $_[0] =~ m`\s*\d+\s*` ?  'arbitrary_length_number' : ''; }, #arbitray length number
    #example : 5.95
    sub {return $_[0] =~ m`\s*\d+\.\d+\s*` ?  'single_decimal_number' : ''; }, #decimal
    #example : 1985/01/21
    sub {return $_[0] =~ m`\s*(?:\")?\s*\d+\/\d+\/\d+\s*(?:\")?\s*` ?  'three_slash_delimited_number' : ''; }, #possible date
    #example : Douglas Adams
    sub {return $_[0] =~ m`\s*[A-Z][a-z]+\s+[A-Z][a-z]+\s*` ?  'formal_first_last_name' : ''; }, #formal first and last name
    #example : 0465026567
    sub {return $_[0] =~ m`\s*\d{10}\s*` ?  'ISBN_size_number' : ''; }, #ISBN
);

#idenfity which type of data each column could possibly be.
#build a list of possiblities with weighted amounts per type
#identified.
foreach my $field (keys %field_data) {
    my @data_tokens = split(',',$field_data{$field});
    foreach my $data (@data_tokens) {
        #print $data ."\n";
        foreach my $type_identifier (@type_identifier_subs) {
            #print $cleaner->($data) ."\n";
            my $type = $type_identifier->($data);
            $possible_types{$field}->{$type} += 1 if($type);
        }
    }
}

#find the best weighted type out of all the filtered types
#identified.
my %suggested_ecl_type;
foreach my $field (@fields) {
    my $max_weight = 0;
    my $best_type;
    foreach my $type ( sort keys %{$possible_types{$field}}) {
        if($max_weight < $possible_types{$field}{$type}){
            $max_weight = $possible_types{$field}{$type};
            $best_type = $type;
        }
    }
    $suggested_ecl_type{$field} = $best_type;
}
#print out the potential Record Layout:

my %ecl_types = (
    'ISBN_size_number' => 'INTEGER10',
    'formal_first_last_name' => 'STRING30',
    'three_slash_delimited_number' => 'STRING8',
    'single_decimal_number' => 'DECIMAL2_2'
);

my $file_name = "csv";
my $dataset_name = '~CLASS::BMF::AdvECL::';
print_layout();
sub print_layout {
    open FILE, ">", "eclcode.ecl" or die $!;
   
    print FILE "EXPORT " . $file_name . " := MODULE\n\n";
    print FILE "EXPORT Layout := RECORD\n";
    foreach my $field (@fields) {
        #make sure the fields are lower cased and that spaces are underscored.
        #or make sure words are in camelcase depending on the specs. We can pull this
        #information from the project's coding standards configureation file.
        print FILE "\t " . $ecl_types{$suggested_ecl_type{$field}} . " " . lc$field . "\n";
    }
   
    print FILE "END;\n\n";
    print FILE "EXPORT File := DATASET(" . $dataset_name . $file_name .",Layout,CSV)\n";
    close FILE;
}


print Dumper(\%possible_types) . "\n";

open FILE, "<", "eclcode.ecl" or die $!;

print $_ while(<FILE>);

close FILE;



Here is the CVS file needed by this script:
"REVIEW_DATE","AUTHOR","ISBN","DISCOUNTED_PRICE"
"1985/01/21","Douglas Adams",0345391802,5.95
"1990/01/12","Douglas Hofstadter",0465026567,9.95
"1998/07/15","Timothy ""The Parser"" Campbell",0968411304,18.99
"1999/12/03","Richard Friedman",0060630353,5.95
"2001/09/19","Karen Armstrong",0345384563,9.95
"2002/06/23","David Jones",0198504691,9.95
"2002/06/23","Julian Jaynes",0618057072,12.50
"2003/09/30","Scott Adams",0740721909,4.95
"2004/10/04","Benjamin Radcliff",0804818088,4.95
"2004/10/04","Randel Helms",0879755725,4.50



Here is the ECL code which is generated:

Code: Select all
EXPORT csv := MODULE

EXPORT Layout := RECORD
    STRING20 REVIEW_DATE
    STRING30 AUTHOR
    INTEGER10 ISBN
    DECIMAL2_2 DISCOUNTED_PRICE
END;

EXPORT File := DATASET(~CLASS::BMF::AdvECL::csv,Layout,CSV)
jamesb
 
Posts: 6
Joined: Fri Aug 03, 2012 7:51 pm

Mon Aug 06, 2012 7:27 pm Change Time Zone

James,

Good work! That kind of structure generation is reasonably straight-forward with CSV files, a bit more complicated with XML files, but a LOT more difficult with binary flat files (fixed or variable-length).

One thing I noticed is that your code generates an INTEGER10 data type for your ISBN field -- you should modify it to make that a DECIMAL10 instead, since INTEGER10 is not a legal data type (the '10' portion defines the number of bytes the field occupies, not the number of digits in the number, so the range of valid values is only 1 through 8).

My first blog posting (http://hpccsystems.com/blog/adventures-machine-learning-part-1) contains some ECL code to help define the field sizes of CSV data. Since CSV is an inherently text-based format, there's no reason not to simply make each field a STRING.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1370
Joined: Wed Oct 26, 2011 7:40 pm

Tue Aug 07, 2012 5:03 pm Change Time Zone

Thanks for the help Richard! I am curious if there has been any work on auto generating ECL code from files. Has anyone else worked on this or is there any future plans on developing out this feature?
jamesb
 
Posts: 6
Joined: Fri Aug 03, 2012 7:51 pm

Tue Aug 07, 2012 5:49 pm Change Time Zone

James,

Not that I am aware of. We've done some exploration of this issue in the past, which is why I know CSV files are not too hard to handle (especially if you have a first rec with column headings in it, which is not always the case) and XML files can be parsed to produce a reasonable starting-point RECORD structure, too.

But binary flat files (comprising a major portion of the data receive) are very difficult to do this with, because there's nothing there but the data itself to try to work with. Just deciding what a field might be is an interesting exercise -- for example, is a byte containing ASCII 65 an integer 65 or a capital "A"? And where are the field boundaries -- is that ASCII 0 byte a null terminator on a string, or one of the bytes of a 4-byte integer, or ... ?

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1370
Joined: Wed Oct 26, 2011 7:40 pm

Tue Aug 07, 2012 5:56 pm Change Time Zone

Ah that makes sense why there hasn't been much done for binary files. So if you were to give the 3 categories a percentage for how frequently each type is dealt with what would your spread be? That is how often do we see binary files, CSV, and XML files?
jamesb
 
Posts: 6
Joined: Fri Aug 03, 2012 7:51 pm

Tue Aug 07, 2012 7:40 pm Change Time Zone

Ah yes -- that would be a question about our actual products and the data they're built from. Sorry, but my area is the ECL language and HPCC platform (keeping to the "theoretical" side of things and away from "production" stuff) -- IOW, I don't know. :)

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1370
Joined: Wed Oct 26, 2011 7:40 pm

Tue Aug 07, 2012 7:54 pm Change Time Zone

Ok no problem, the only reason I ask is to get a fuller view of how valuable in time saving this feature might be and to help weight whether it is worth investing time in developing out.

Can you give me contacts of people who have already worked on some of these ideas? I would love to hear why the feature was not completed out or chosen not to be implemented.

I would like to learn about some of the reasoning and decision making processes that go on while deciding on features like this one.

Thanks for all the replies Richard!
jamesb
 
Posts: 6
Joined: Fri Aug 03, 2012 7:51 pm


Return to Data Modeling

Who is online

Users browsing this forum: No registered users and 1 guest

cron