## Regression tutorial

Dear Awesome HPCC Team,

I was in the middle of writing a tutorial on how to use the ML library for modeling (regression analysis) when something turned up.

Now I'm not even remotely close to being a figment of authority in either SAS nor Modeling/Regression Analysis, so I'm porting over a SAS(TM) tutorial instead.

Here's the link:

https://stats.idre.ucla.edu/sas/webbook ... egression/

My goal is 2-fold:

My problem is that I'm having a hard time replicating SAS results.

I do get the same results in simple regression between api00 and enroll (using elemapi2, the cleaned up data file) for example.

But the parameter acs_k3 (average class size in kindergarten through 3rd grade) is negative in SAS, yet I end up with a very positive one (~16.7 with ML and -0.71 in SAS) when running the model of "api00 = acs_k3 meals full".

My concern here is that, if I let ML run a step regression to find the best model, how do I know HPCC will find the best one (like SAS would for example, not in absolute)?

I tried it using

Has anyone done some comparison between SAS and ML? This would help me understand the differences (I don't know much about SAS and documentation is pretty opaque too).

Any similar work than what I'm trying to do here already out there?

Thanks!

I was in the middle of writing a tutorial on how to use the ML library for modeling (regression analysis) when something turned up.

Now I'm not even remotely close to being a figment of authority in either SAS nor Modeling/Regression Analysis, so I'm porting over a SAS(TM) tutorial instead.

Here's the link:

https://stats.idre.ucla.edu/sas/webbook ... egression/

My goal is 2-fold:

- - Show concrete tutorial with multiple technics (methods) to understand data and create a model

- Compare HPCC ML vs. SAS (and shows how to replicate what SAS does in ECL)

My problem is that I'm having a hard time replicating SAS results.

I do get the same results in simple regression between api00 and enroll (using elemapi2, the cleaned up data file) for example.

But the parameter acs_k3 (average class size in kindergarten through 3rd grade) is negative in SAS, yet I end up with a very positive one (~16.7 with ML and -0.71 in SAS) when running the model of "api00 = acs_k3 meals full".

My concern here is that, if I let ML run a step regression to find the best model, how do I know HPCC will find the best one (like SAS would for example, not in absolute)?

I tried it using

- Code: Select all
`ML.StepRegression.ForwardRegression`

Has anyone done some comparison between SAS and ML? This would help me understand the differences (I don't know much about SAS and documentation is pretty opaque too).

Any similar work than what I'm trying to do here already out there?

Thanks!

- lpezet
**Posts:**56**Joined:**Wed Sep 10, 2014 3:14 am

I have not compared the version in the ecl-ml repository to SAS. I would expect some differences with something like a step-wise procedure because there are several choices for the criteria to use for selecting the best variable to add in a step and there are several different approaches to stepwise. The attributes in ML/StepRegression use AIC to select the best variable.

There are several analytic techniques that have been re-built (such as Multiple Linear Regression and Random Forests) as supported bundles. The bundle documentation describes which external implementations were used in a validation role and in a performance comparison role.

Assuming that the SAS implementation of Stepwise is using AIC, and your SAS model is using one of the three forms of step-wise that are provided in the ecl-ml repository, I would be interested in looking at the issue. Please create a Jira report describing what you are finding and please provide a link to the data.

Thanks.

There are several analytic techniques that have been re-built (such as Multiple Linear Regression and Random Forests) as supported bundles. The bundle documentation describes which external implementations were used in a validation role and in a performance comparison role.

Assuming that the SAS implementation of Stepwise is using AIC, and your SAS model is using one of the three forms of step-wise that are provided in the ecl-ml repository, I would be interested in looking at the issue. Please create a Jira report describing what you are finding and please provide a link to the data.

Thanks.

- john holt
- Community Advisory Board Member
**Posts:**22**Joined:**Mon Jun 25, 2012 12:43 pm

Hi John!

So I checked and looks like SAS calculates the F statistics for each independent variable at each step when using the forward method. So it's a different approach.

I found another example of step regression and reported here (contains ECL code too to reproduce):

https://github.com/lpezet/hpcc_vs_sas/t ... Prediction

Now I'm inclined to look into ForwardRegression.ecl and implement the p-value way just for fun

Thanks.

So I checked and looks like SAS calculates the F statistics for each independent variable at each step when using the forward method. So it's a different approach.

I found another example of step regression and reported here (contains ECL code too to reproduce):

https://github.com/lpezet/hpcc_vs_sas/t ... Prediction

Now I'm inclined to look into ForwardRegression.ecl and implement the p-value way just for fun

Thanks.

- lpezet
**Posts:**56**Joined:**Wed Sep 10, 2014 3:14 am

3 posts
• Page

**1**of**1**### Who is online

Users browsing this forum: No registered users and 0 guests