Sun Dec 08, 2019 10:25 am

## Matrix Multiply Embedded Python and GPU

Topics related to the set of Machine Learning libraries and Matrix processing algorithms
Does anyone know a better data structure for matrices being passed to embedded python where a GPU will multiply them? By better I mean the conversion to a GPU format will take the least amount of time.

In the following implementation, it takes the embedded python code approximately 21 seconds to convert to GPU format the 2 passed matrices that are 10,000 by 20,000 and 20,000 by 10,000. But, it only takes the GPU approximately 3 1/3 seconds to multiply the 2 matrices once the matrices are converted.

In the following implementation, the embedded python code computes the amount of time is takes to a) convert the 2 matrices to GPU format; b) do matrix multiplication, and c) convert the resultant matrix to a ecl dataset format.
The embedded python code puts the times in a file called TimeOfMatrixMultiply.txt. The directory where the file is stored is /var/lib/HPCCSystms/myeclagent.

(NOTE: All the ecl code I show below was ran on hthor.)

NOTE:I deployed my hpcc system on AWS using a g2.2xlarge instance type, which comes with a GPU. Also in region us-west-2, I used the AMI, ami-638c1e03, which comes with all the software you will need, except hpcc.

Code: Select all
`import python;matrec := RECORD SET OF REAL arow;END;DATASET(matrec) MatrixMultiply(DATASET(matrec) A, unsigned nrowsA, unsigned ncolsA,DATASET(matrec) B, unsigned nrowsB, unsigned ncolsB) := embed(Python)    import numpy as np    import re    import cudamat as cm    import gnumpy as gp    import sys    sys.stdout=open('TimeOfMatrixMultiply.txt','w')        def ECLDataset2NPArray(s,r,c):      zarray=np.empty([r,c],dtype=float)      i=0      for row in s:        zarray[i]=np.asarray(row)        i+=1      return zarray        from timeit import default_timer as timer    start1 = timer()    Aarray=ECLDataset2NPArray(A,nrowsA,ncolsA)    Barray=ECLDataset2NPArray(B,nrowsB,ncolsB)    cm.cublas_init()    cm.CUDAMatrix.init_random()    A_cm = gp.garray(Aarray)    B_cm = gp.garray(Barray)    ecl2cm_time=timer() - start1    print("Time to just convert 2 ecl matrices (%d X %d) and (%d X %d) to cm was %f seconds" % (nrowsA,ncolsA,nrowsB,ncolsB,ecl2cm_time))    time2=timer()    C_cm = A_cm.dot(B_cm)    mm_time=timer() - time2    print("Just GPU matrix multiple of 2 matrices (%d X %d) and (%d X %d)  took %f seconds" % (nrowsA,ncolsA,nrowsB,ncolsB,mm_time))    time3=timer()    cm.cublas_shutdown()    Carray=gp.as_numpy_array(C_cm)    ecl_Carray=Carray.tolist()    cm2ecl_time=timer() - time3    print("Time to just convert 2 cm matrices (%d X %d) and (%d X %d) to ecl was %f seconds" % (nrowsA,ncolsA,nrowsB,ncolsB,cm2ecl_time))    return ecl_Carrayendembed;//The A and B matrices (datasets) were created by "make2MatricesAndStoreOnDisk_b.ecl".A:=DATASET('~hthor::tlh::AMatrix',matrec,THOR);B:=DATASET('~hthor::tlh::BMatrix',matrec,THOR);NRowsA:=COUNT(A);NColsA:=COUNT(A.arow);NRowsB:=COUNT(B);NColsB:=COUNT(B.arow);MM:=MatrixMultiply(A,NRowsA,NColsA,B,NRowsB,NColsB);OUTPUT(MM,,'tlh::AMatrixAndBMatrixProduct',OVERWRITE)`

The following embedded python will get the timings.
Code: Select all
`import python;timings := RECORD  STRING line;END;DATASET(timings) readTimings() := EMBED(python)  timings = open('/var/lib/HPCCSystems/myeclagent/TimeOfMatrixMultiply.txt').readlines()  return timingsENDEMBED;OUTPUT(readTimings());`

The following ecl code makes the 2 matrices that are multiplied.
Code: Select all
`matrec := RECORD SET OF REAL arow;END;MakeMatrix(UNSIGNED NRows, UNSIGNED NCols) := FUNCTION    REAL RealRandom() := FUNCTION  r:= RANDOM()/4294967295;  return r;  END;  rec0 := RECORD    REAL cell;  END;     rec1 := RECORD    DATASET(rec0) arow;  END;  rec1 genMatrix(rec1 r, UNSIGNED row_num, UNSIGNED ncols) := TRANSFORM   SELF.arow:=NORMALIZE(DATASET([{0}],rec0), ncols, TRANSFORM(rec0,SELF.cell:=RealRandom()));  END;  rec1DS:=NORMALIZE(DATASET([{DATASET([{0}],rec0)}],rec1),NRows,genMatrix(LEFT,COUNTER,NCols));  matrec Dataset2Set(rec1 L) := TRANSFORM    SELF.arow := SET(L.arow,cell);  END;    RETURN PROJECT(rec1DS,Dataset2Set(LEFT));END;NRows1:= 10000;NCols1:= 20000;A:=MakeMatrix(NRows1, NCols1);OUTPUT(A,,'tlh::AMatrix',OVERWRITE);NRows2:= NCols1;NCols2:= NRows1;B:=MakeMatrix(NRows2, NCols2);OUTPUT(B,,'tlh::BMatrix',OVERWRITE);`
tlhumphrey2

Posts: 258
Joined: Mon May 07, 2012 6:23 pm

Tim,

I ran your MakeMatrix() code (changing to 10 rows and 20 columns) and saw that each row ends up with exactly the same data. Is that intentional?

HTH,

Richard
rtaylor Posts: 1488
Joined: Wed Oct 26, 2011 7:40 pm

Richard,

No, numbers in different rows should not be the same. I'm surprised they are.

Could the problem be the NORMALIZE I have in the transform, genMatrix, of a 2nd NORMALIZE? Plus, maybe I'm thinking procedurally instead of declaratively???

Tim
tlhumphrey2

Posts: 258
Joined: Mon May 07, 2012 6:23 pm

Tim,

I'm not entirely sure why your code was getting the same row repeated each time, but here's my version of your code that DOES generate a different row for each record.
Code: Select all
`MakeMatrix(UNSIGNED NRows, UNSIGNED NCols) := FUNCTION  rec0 := {REAL cell};  matrec := {SET OF REAL arow};  REAL RealRandom :=  RANDOM()/4294967295;  RealDS := DATASET(NRows*NCols,TRANSFORM(rec0,SELF.cell := RealRandom));  MatRec DS2Set(INTEGER C) := TRANSFORM    EndRec := C*NCols;     BegRec := (EndRec - NCols) + 1;     SELF.arow := SET(RealDS[BegRec..EndRec],cell);  END;  RETURN DATASET(NRows,DS2Set(COUNTER));END;NRows1:= 10;NCols1:= 20;NRows2:= NCols1;NCols2:= NRows1;A:=MakeMatrix(NRows1, NCols1);OUTPUT(A,,'tlh::AMatrix',OVERWRITE);B:=MakeMatrix(NRows2, NCols2);OUTPUT(B,,'tlh::BMatrix',OVERWRITE);`
Note that I'm just generating all the cells in one pass, then just splitting them into their separate recs.

HTH,

Richard
rtaylor Posts: 1488
Joined: Wed Oct 26, 2011 7:40 pm

Richard,

I like your code. It does everything I wanted to do inside MakeMatrix. And, it is much shorter than my code.

I did notice one odd behavior of the code -- matrices A and B have the same values. Again, I don't understand this behavior. I entered a jira asking why (HPCC-19025).

Tim
tlhumphrey2

Posts: 258
Joined: Mon May 07, 2012 6:23 pm