Sun Apr 22, 2018 7:13 am
Login Register Lost Password? Contact Us


simple data profiling code

Questions around writing code and queries

Tue Jul 19, 2011 6:34 pm Change Time Zone

I'm running into issues (likely user error) with a simple data profiling task.

The first file I'm working with is a 26 field tab delimited file. I sprayed it using CSV (\t delimiter) and the output looks correct for the 100 record sample.

I wanted to find the min/max field lengths so I used a PROJECT/TRANSFORM. The code looks approximately like this:
Code: Select all
m_Name := 'somepath::somefile';
m_Format := RECORD
   string f1;
   string f2;
   string fn;
END;

m_Dataset := DATASET(m_Name, m_Format,   CSV(HEADING(1))   );

m_FormatLength GetLengths(m_Format L) := TRANSFORM
   SELF.f1 := LENGTH(L.f1);
   SELF.f2 := LENGTH(L.f2);
   SELF.fn := LENGTH(L.fn);
END;

m_Lengths := PROJECT(m_Dataset,GetLengths(LEFT));

output(m_Dataset);
output(m_Lengths);

This also looked correct for the 100 record sample.

For the next step I want to produce 1 row listing the max length for fields f1 through fn. I tried another PROJECT/TRANSFORM but that listed every row instead of just 1. Next I tried a RECORD/TABLE and here is the interesting part; it seems that forcing the iteration over the entire file causes "Result 1" (m_Dataset) to cram everything into field 1. Before I go off the deep end investigating the file format, is this a common user-error with a common fix?

Thanks in advance!
aintnomyth
 
Posts: 86
Joined: Wed Jul 13, 2011 7:40 pm

Tue Jul 19, 2011 6:57 pm Change Time Zone

PROJECT and TABLE (in its normal form) generate an output row for each input row.

A maximum for a single field could be calculated using

output(max(m_Lengths, f1));

To calculate them all at the same time you want to use the aggregating form of TABLE.

m_maxLengths := TABLE(m_Lengths, { maxf1 := MAX(group, f1), maxf2 := MAX(group, f2), maxfn := MAX(group, fn) });

output(m_maxLengths);
ghalliday
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 178
Joined: Wed May 18, 2011 9:48 am

Tue Jul 19, 2011 8:18 pm Change Time Zone

Awesome, thanks for the help and the quick reply.
aintnomyth
 
Posts: 86
Joined: Wed Jul 13, 2011 7:40 pm

Wed Jul 20, 2011 8:44 am Change Time Zone

For your information, the efficiency of

output(m_Lengths); or output(m_maxLengths);

when there is no output(m_Dataset)

is improved (when coming from a csv file) in the next (3.1) release of the platform.
ghalliday
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 178
Joined: Wed May 18, 2011 9:48 am


Return to Programming

Who is online

Users browsing this forum: No registered users and 1 guest