Fri Aug 17, 2018 3:16 am
Login Register Lost Password? Contact Us


Iterating Data Analysis

Topics specific to using ECL from a Data Analyst standpoint

Wed Jun 06, 2012 6:02 pm Change Time Zone

Greetings, I am relatively new to this board so I am not sure I am posting in the correct forum. However I have a large data set of a population that I need to compute the hazard rate for a fixed amount of time, and I need to output the survival rate as a function of time so I can plot it and compare to other populations.

Now I am very new to programming in ECL so all advice is welcome, but the algorithm I need to use boils down to:

Hazard_Rate(t) = COUNT($.pop(time = t))/COUNT($.pop(time >= t)

I need to be able to assign a range 1:60 to t and output the results to the user.

Thanks in advance for the help.
pyrannis
 
Posts: 6
Joined: Wed Jun 06, 2012 5:50 pm

Wed Jun 06, 2012 7:17 pm Change Time Zone

Essentially, what you need to do is to use a NORMALIZE,which gives you the ability to iterate a finite amount. Here's an inline example:

Code: Select all
baseRec := RECORD
   INTEGER1 Time;
   INTEGER  Casualty;
END;

BaseFile := DATASET([{1,10},{2,30},{3,50},{4,55}],baseRec);
         
OutRec    := RECORD
   INTEGER1 Time;
   REAL4    Hazard_Rate;
END;                        

OutRec DoHazards(BaseFile Le,INTEGER t) := TRANSFORM
SELF.Time := Le.time;
SELF.Hazard_Rate := COUNT(BaseFile(Le.time = t))/COUNT(BaseFile(Le.time >= t));
END;

//I need to be able to assign a range 1:60
//to t and output the results to the user.

normout := NORMALIZE(BaseFile,60,DoHazards(LEFT,COUNTER));
normout;


Regards,

Bob
bforeman
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 975
Joined: Wed Jun 29, 2011 7:13 pm

Wed Jun 06, 2012 8:31 pm Change Time Zone

Thanks for the quick reply bforman,

However, I believe your code performs the hazard calculation on each record 60 times, I need to perform that calculation once at each time step on every record in my data set since all I have is a list of patient data and the time that the they visited the doctor.

If I were to write this in C++ it would look something like

double Hazard_Function[60];

for(i = 1; i <= 60; ++i)
{
Hazard_Function[i] = Patients_This_Month(i)/Patients_Waiting(i);
}

Patients_This_Month(i): function that goes through my list of patient data and counts the number of patients that go went to the doctor in that month

Patients_Waiting(i): function that goes through my list of patient data and counts the patients that are going to the doctor's office this month and all patients that still have not gone to the doctor's office.
pyrannis
 
Posts: 6
Joined: Wed Jun 06, 2012 5:50 pm

Wed Jun 06, 2012 8:56 pm Change Time Zone

OK, so if you want to iterate exactly 60 times, then your code should be something like this:
Code: Select all
baseRec := RECORD
   INTEGER1 Month;
   INTEGER  Casualty;
END;

BaseFile := DATASET([{1,10},{2,30},{3,50},{4,55}],baseRec);

OutRec    := RECORD
   INTEGER1 Time;
   REAL4    Hazard_Rate;
END;                       
BlankDS := DATASET([{0,0}],OutRec);         

OutRec DoHazards(INTEGER t) := TRANSFORM
  SELF.Time := t;
  SELF.Hazard_Rate := COUNT(Basefile(Month = t))/COUNT(Basefile(Month > t));
END;

normout := NORMALIZE(BlankDS,60,DoHazards(COUNTER));
normout;
Note the use of the single-record "BlankDS" for NORMALIZE. This allows you to set the specific number of iterations.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1369
Joined: Wed Oct 26, 2011 7:40 pm

Thu Jun 07, 2012 3:55 pm Change Time Zone

Thank you Richard,

That is exactly what I need to get it started, and running. However if you or anyone else has a second I could use some help understanding the difference between the two sections of code I was given. namely your NORMALIZE calls because I do not really understand why the first one calls the code 60 times for each record and Richards acts computes each step off of the full record.

Also this is an abrupt change of topic, but is there is good tutorial how to use the LEFT and RIGHT pointers anywhere? Something tells me I do not really want to create a new data set every time I need to compute somethings because the next version of this algorithm will probably take of couple of iterations and without using LEFT and RIGHT that could take up a lot of space.

Thanks in advance.
pyrannis
 
Posts: 6
Joined: Wed Jun 06, 2012 5:50 pm

Thu Jun 07, 2012 9:04 pm Change Time Zone

pyrannis,
That is exactly what I need to get it started, and running. However if you or anyone else has a second I could use some help understanding the difference between the two sections of code I was given. namely your NORMALIZE calls because I do not really understand why the first one calls the code 60 times for each record and Richards acts computes each step off of the full record.
The first parameter to Bob's NORMALIZE was his BaseData dataset, which contained several records, while the first parameter to my NORMALIZE was the BlankDS dataset, which contained exactly one record. NORMALIZE always calls the TRANSFORM function the number of times specified in its second parameter (in this case, 60) for each record in the dataset specified as its first parameter. That's why my code called the TRANSFORM exactly 60 times.

Also this is an abrupt change of topic, but is there is good tutorial how to use the LEFT and RIGHT pointers anywhere? Something tells me I do not really want to create a new data set every time I need to compute somethings because the next version of this algorithm will probably take of couple of iterations and without using LEFT and RIGHT that could take up a lot of space.
First off, there is no such thing in ECL as a "pointer" -- LEFT and RIGHT are simply "disambiguators" that are used in circumstances where you are operating on a pair of records (or datasets) and need to qualify which record the specific field is from. For example:

Code: Select all
d := DEDUP(ds,LEFT.Field1 = RIGHT.Field2);
This code defines a deduped recordset, where "duplicates" are any records where the Field1 value in the first record matches the Field2 value in the second. The first rec (the LEFT) will be compared to the second (the RIGHT), and if the values match, the second rec will be thrown away...

Secondly, when you say, "I do not really want to create a new data set every time I need to compute somethings ... that could take up a lot of space" you are showing me that you are thinking about ECL procedurally (a bad mistake to make). ECL is a DECLARATIVE, NON-PROCEDURAL language. That means that all your ECL code ever does is define what you want, not how the job gets done -- therefore you are never writing "executable code" when you write ECL. A definition is just that -- a definition. The executable code that actually does the work is generated for you by the compiler. These are fundamental concepts that we drill into students when they come to our ECL training classes (highly recommended).

OK, given that clarification :) , if you would just fully describe the problem you're trying to solve, then we can make some suggestions as to the best "ECLish" approach to take.

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1369
Joined: Wed Oct 26, 2011 7:40 pm

Fri Jun 08, 2012 12:50 pm Change Time Zone

So, the full problem I am trying to solve goes something as follows:
*Disclaimer: I will be as up front about all the steps I can be, but I have not figured out the fine details for some of these steps yet.

I have a large collection of patient data that contains information about their cancer and the eventual outcome over a five year period. The goal is to take the patient data and extract information about survival and incident rates.

The prototype version I have is in R and it takes the following steps:
1) Group the patients with the characteristics you want to compare
2) Compute the hazard rate and the survival rate over the 5 year period

From here on out I am still figuring out what the prototype is doing so it will get a little vague.
3) Compute the distance between the different clusters
4) Hierarchically cluster the various groups to see which groups are similar and which groups are different

Right now I am looking solely at ECL, but there will be a front end user interface at the end of this project where the user can determine which statistics they wish to compare, so I need to keep in mind that the total number of clusters that are being compared is not a fixed number every time this code will be called.

Thanks in advance,

Stetson
pyrannis
 
Posts: 6
Joined: Wed Jun 06, 2012 5:50 pm

Fri Jun 08, 2012 1:39 pm Change Time Zone

Stetson,
The prototype version I have is in R and it takes the following steps:
1) Group the patients with the characteristics you want to compare
2) Compute the hazard rate and the survival rate over the 5 year period
I have a book on Machine Learning I'm reading now that uses R for all its example code, so I understand that R is a language specifically created for doing statistical analysis.

OK, to me your step 1 simply means filtering the patient records to the set that you want to work with. Step 2 is what the NORMALIZE code previously posted can do.
From here on out I am still figuring out what the prototype is doing so it will get a little vague.
3) Compute the distance between the different clusters
4) Hierarchically cluster the various groups to see which groups are similar and which groups are different
Now here you're starting to get into what R is all about -- Machine Learning. I'm afraid that I'm a neophyte in that area (that's why I'm reading the book), so my best suggestion is to take a look at our Machine Learning resources, starting here http://hpccsystems.com/ml

Perhaps someone with more Machine Learning experience in ECL than I can chime in at this point and teach us both. :)

HTH,

Richard
rtaylor
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 1369
Joined: Wed Oct 26, 2011 7:40 pm

Fri Jun 08, 2012 3:43 pm Change Time Zone

I am still not 100% clear on the spec. However - the ECL way is to work bottom up from the data. So - firstly - suppose I want to know how many people turned up to the doctor in a given month...

Visits := TABLE(patient,{month_id,Cnt := COUNT(GROUP)},month_id,FEW);

Now - supposing you want to know when a patients first & last visit is:

Firsts := TABLE(patient,{patient_id,Frst := MIN(GROUP,month_id),Lst := MAX(GROUP,month_id)},patient_id);

Now - suppose you want to annotate the patient data with whether or not this was the first or last visit:

SomeType TakeExtrema(Patient le,First ri) := TRANSFORM
SELF.IsFirst := le.month_id=ri.frst;
SELF.IsLast := le.month_id=ri.lst;
SELF := le;
END;

Patient_Plus := JOIN(patient,firsts,LEFT.patient_id=RIGHT.patient_id,TakeExtrema(LEFT,RIGHT));

I believe that following this methodology you will get the information you want in a format that is usable (and it will run with full parallelism.

DAB
dabayliss
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 109
Joined: Fri Apr 29, 2011 1:35 pm

Fri Jun 08, 2012 5:16 pm Change Time Zone

Thanks for response dabayliss,

I am just going to attempt to clarify the spec for you because while the code you have put up is useful, it is not what I need to do.

To start at the beginning I have a set of patient data that concerns a type of cancer. This data contains not personal or identification information, but it contains relevant medical data like size of tumor, how defined the tumor was, what stage cancer it was and so on and so forth.

To get what I need out of this there are several steps.
1) I need to group the patients by the information that I wish to compare whether it be the size of the tumor or the stage of the cancer the patient was in, some other piece of information that is is the file or some combination of information so I can look at refined data sets for step 2.

2) I need to compute the incidence rates and overall survival rate for each group I create in step 1. For more information I recommend looking up hazard ratio and survival rate on Wikipedia because those are the algorithms I need to implement for this step and I do not wish to make this post overly long

3) I need to compute the distance between the clusters so I can cluster them in a Hierarchical fashion so people can look at how closely relate two groups are and then look at how their survival rates and hazard ratios over time differ

4) This is more a problem for the future after I get steps 1-3 working but I need to figure out how to do this for arbitrarily amounts of unique patient groups.

I hope this clarified things a little bit and if you have further questions please feel free to ask.

Stetson
pyrannis
 
Posts: 6
Joined: Wed Jun 06, 2012 5:50 pm

Next

Return to ECL for Analysts

Who is online

Users browsing this forum: No registered users and 1 guest

cron