Sun Apr 22, 2018 6:13 pm
Login Register Lost Password? Contact Us


Regression tutorial

Topics related to the set of Machine Learning libraries and Matrix processing algorithms

Mon Jan 15, 2018 8:30 pm Change Time Zone

Dear Awesome HPCC Team,

I was in the middle of writing a tutorial on how to use the ML library for modeling (regression analysis) when something turned up.
Now I'm not even remotely close to being a figment of authority in either SAS nor Modeling/Regression Analysis, so I'm porting over a SAS(TM) tutorial instead.
Here's the link:
https://stats.idre.ucla.edu/sas/webbook ... egression/

My goal is 2-fold:
    - Show concrete tutorial with multiple technics (methods) to understand data and create a model
    - Compare HPCC ML vs. SAS (and shows how to replicate what SAS does in ECL)

My problem is that I'm having a hard time replicating SAS results.
I do get the same results in simple regression between api00 and enroll (using elemapi2, the cleaned up data file) for example.
But the parameter acs_k3 (average class size in kindergarten through 3rd grade) is negative in SAS, yet I end up with a very positive one (~16.7 with ML and -0.71 in SAS) when running the model of "api00 = acs_k3 meals full".

My concern here is that, if I let ML run a step regression to find the best model, how do I know HPCC will find the best one (like SAS would for example, not in absolute)?
I tried it using
Code: Select all
ML.StepRegression.ForwardRegression
and SAS and HPCC find 2 different sets of independent variables for api00.

Has anyone done some comparison between SAS and ML? This would help me understand the differences (I don't know much about SAS and documentation is pretty opaque too).
Any similar work than what I'm trying to do here already out there?


Thanks!
lpezet
 
Posts: 51
Joined: Wed Sep 10, 2014 3:14 am

Tue Jan 16, 2018 1:15 pm Change Time Zone

I have not compared the version in the ecl-ml repository to SAS. I would expect some differences with something like a step-wise procedure because there are several choices for the criteria to use for selecting the best variable to add in a step and there are several different approaches to stepwise. The attributes in ML/StepRegression use AIC to select the best variable.

There are several analytic techniques that have been re-built (such as Multiple Linear Regression and Random Forests) as supported bundles. The bundle documentation describes which external implementations were used in a validation role and in a performance comparison role.

Assuming that the SAS implementation of Stepwise is using AIC, and your SAS model is using one of the three forms of step-wise that are provided in the ecl-ml repository, I would be interested in looking at the issue. Please create a Jira report describing what you are finding and please provide a link to the data.

Thanks.
john holt
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 22
Joined: Mon Jun 25, 2012 12:43 pm

Sat Feb 17, 2018 3:03 am Change Time Zone

Hi John!

So I checked and looks like SAS calculates the F statistics for each independent variable at each step when using the forward method. So it's a different approach.

I found another example of step regression and reported here (contains ECL code too to reproduce):
https://github.com/lpezet/hpcc_vs_sas/t ... Prediction

Now I'm inclined to look into ForwardRegression.ecl and implement the p-value way just for fun :)

Thanks.
lpezet
 
Posts: 51
Joined: Wed Sep 10, 2014 3:14 am


Return to Machine Learning

Who is online

Users browsing this forum: No registered users and 0 guests

cron