Sat Aug 18, 2018 11:50 pm
Login Register Lost Password? Contact Us


Random Seed Persistence

Topics related to the set of Machine Learning libraries and Matrix processing algorithms

Thu Oct 12, 2017 3:07 pm Change Time Zone

In testing out the Machine Learning functionalities of HPCC, I encounter an issue one might term “The Persistence of Random Seeds.” Whereas the usual problem is how to initialize random seeds for reproducibility purposes from run to run, I’m having the opposite problem of a given WorkUnit’s random generation calls all giving me the same results.

For example, the following code is intended to create 3 standard normal random samples of 100 observations each (x1, x2, and x3). However, I find that all three of these results are in fact identical. (They do differ from run to run – just not among themselves within a given WorkUnit.)

IMPORT * FROM ML;
IMPORT * FROM ML.Cluster;
IMPORT * FROM ML.Types;

a1 := ML.Distribution.Normal(0.0,1,10000);

x1 := ML.Distribution.GenData(100,a1,1);
x2 := ML.Distribution.GenData(100,a1,1);
x3 := ML.Distribution.GenData(100,a1,1);

x1;
x2;
x3;

ave(x1,value);
ave(x2,value);
ave(x3,value);

The issue appears to be fundamental in that if I instead select three different mean and standard deviations pairs, the x1, x2, and x3 are precisely translates/rescalings of one another, i.e., they still apparently rely on a common underlying random number stream.

Any help is appreciated.
tlitherland
 
Posts: 1
Joined: Thu Oct 12, 2017 3:02 pm

Thu Oct 12, 2017 9:17 pm Change Time Zone

The random seed was not reused, rather you only computed one dataset of random values instead of the three datasets you wanted.

It is useful to recall that the ECL language statements are definitions and that there are action statements that cause the computation to occur.

If you examine the graph, you will see that the 3 "GenData" definitions have been recognized as the same definition. Since x1, x2, and x3 definitions are the same, the x1, x2, and x3 actions (and the AVE actions) can use the common definitions.

When I look at the graph generated on a 6.2.22 version platform, I see two sub-graphs. The first has activities 2-9 and correspond to the Distribution definition. This sub-graph produces 2 values, a 10,000 record dataset and a single number; and these are used by the GenData definitions.

The second sub-graph has activities 11-30, and includes the 6 output actions.

If I change the definitions to be:
x1 := ML.Distribution.GenData(100,a1,1);
x2 := ML.Distribution.GenData(101,a1,1);
x3 := ML.Distribution.GenData(102,a1,1);
I get 4 sub-graphs. The first sub-graph is the same, and the second sub-graph has activities 11-26 which writes the x1 dataset and the x1 dataset average. The 3rd and 4th are for the x2 and x3 respectively.

Note that you can use CHOOSEN(...) to make the definitions produce three datasets with 100 records each:
x1 := ML.Distribution.GenData(100,a1,1);
x2 := CHOOSEN(ML.Distribution.GenData(101,a1,1), 100);
x3 := CHOOSEN(ML.Distribution.GenData(102,a1,1), 100);


Now the "GenData" definitions are all different, so you will get 3 different random sequences.

Best,
john holt
Community Advisory Board Member
Community Advisory Board Member
 
Posts: 22
Joined: Mon Jun 25, 2012 12:43 pm

Thu May 17, 2018 10:51 am Change Time Zone

Thank you for sharing useful thread.
JamesHolmes
 
Posts: 1
Joined: Thu May 17, 2018 10:35 am


Return to Machine Learning

Who is online

Users browsing this forum: No registered users and 0 guests

cron