Tue Oct 15, 2019 3:46 am

Matrix Multiply Embedded Python and GPU

Topics related to the set of Machine Learning libraries and Matrix processing algorithms
Does anyone know a better data structure for matrices being passed to embedded python where a GPU will multiply them? By better I mean the conversion to a GPU format will take the least amount of time.

In the following implementation, it takes the embedded python code approximately 21 seconds to convert to GPU format the 2 passed matrices that are 10,000 by 20,000 and 20,000 by 10,000. But, it only takes the GPU approximately 3 1/3 seconds to multiply the 2 matrices once the matrices are converted.

In the following implementation, the embedded python code computes the amount of time is takes to a) convert the 2 matrices to GPU format; b) do matrix multiplication, and c) convert the resultant matrix to a ecl dataset format.
The embedded python code puts the times in a file called TimeOfMatrixMultiply.txt. The directory where the file is stored is /var/lib/HPCCSystms/myeclagent.

(NOTE: All the ecl code I show below was ran on hthor.)

NOTE:I deployed my hpcc system on AWS using a g2.2xlarge instance type, which comes with a GPU. Also in region us-west-2, I used the AMI, ami-638c1e03, which comes with all the software you will need, except hpcc.

Code: Select all
import python;

matrec := RECORD
SET OF REAL arow;
END;

DATASET(matrec) MatrixMultiply(DATASET(matrec) A, unsigned nrowsA, unsigned ncolsA,DATASET(matrec) B, unsigned nrowsB, unsigned ncolsB) := embed(Python)
import numpy as np
import re
import cudamat as cm
import gnumpy as gp

import sys
sys.stdout=open('TimeOfMatrixMultiply.txt','w')

def ECLDataset2NPArray(s,r,c):
zarray=np.empty([r,c],dtype=float)
i=0
for row in s:
zarray[i]=np.asarray(row)
i+=1
return zarray

from timeit import default_timer as timer

start1 = timer()
Aarray=ECLDataset2NPArray(A,nrowsA,ncolsA)
Barray=ECLDataset2NPArray(B,nrowsB,ncolsB)
cm.cublas_init()
cm.CUDAMatrix.init_random()
A_cm = gp.garray(Aarray)
B_cm = gp.garray(Barray)
ecl2cm_time=timer() - start1
print("Time to just convert 2 ecl matrices (%d X %d) and (%d X %d) to cm was %f seconds" % (nrowsA,ncolsA,nrowsB,ncolsB,ecl2cm_time))

time2=timer()
C_cm = A_cm.dot(B_cm)
mm_time=timer() - time2
print("Just GPU matrix multiple of 2 matrices (%d X %d) and (%d X %d)  took %f seconds" % (nrowsA,ncolsA,nrowsB,ncolsB,mm_time))

time3=timer()
cm.cublas_shutdown()
Carray=gp.as_numpy_array(C_cm)
ecl_Carray=Carray.tolist()
cm2ecl_time=timer() - time3
print("Time to just convert 2 cm matrices (%d X %d) and (%d X %d) to ecl was %f seconds" % (nrowsA,ncolsA,nrowsB,ncolsB,cm2ecl_time))
return ecl_Carray

endembed;

//The A and B matrices (datasets) were created by "make2MatricesAndStoreOnDisk_b.ecl".
A:=DATASET('~hthor::tlh::AMatrix',matrec,THOR);
B:=DATASET('~hthor::tlh::BMatrix',matrec,THOR);
NRowsA:=COUNT(A);
NColsA:=COUNT(A.arow);
NRowsB:=COUNT(B);
NColsB:=COUNT(B.arow);

MM:=MatrixMultiply(A,NRowsA,NColsA,B,NRowsB,NColsB);

OUTPUT(MM,,'tlh::AMatrixAndBMatrixProduct',OVERWRITE)

The following embedded python will get the timings.
Code: Select all
import python;

timings := RECORD
STRING line;
END;

DATASET(timings) readTimings() := EMBED(python)
return timings
ENDEMBED;

The following ecl code makes the 2 matrices that are multiplied.
Code: Select all
matrec := RECORD
SET OF REAL arow;
END;

MakeMatrix(UNSIGNED NRows, UNSIGNED NCols) := FUNCTION

REAL RealRandom() := FUNCTION
r:= RANDOM()/4294967295;
return r;
END;

rec0 := RECORD
REAL cell;
END;

rec1 := RECORD
DATASET(rec0) arow;
END;

rec1 genMatrix(rec1 r, UNSIGNED row_num, UNSIGNED ncols) := TRANSFORM
SELF.arow:=NORMALIZE(DATASET([{0}],rec0), ncols, TRANSFORM(rec0,SELF.cell:=RealRandom()));
END;

rec1DS:=NORMALIZE(DATASET([{DATASET([{0}],rec0)}],rec1),NRows,genMatrix(LEFT,COUNTER,NCols));

matrec Dataset2Set(rec1 L) := TRANSFORM
SELF.arow := SET(L.arow,cell);
END;

RETURN PROJECT(rec1DS,Dataset2Set(LEFT));
END;

NRows1:= 10000;
NCols1:= 20000;
A:=MakeMatrix(NRows1, NCols1);
OUTPUT(A,,'tlh::AMatrix',OVERWRITE);

NRows2:= NCols1;
NCols2:= NRows1;
B:=MakeMatrix(NRows2, NCols2);
OUTPUT(B,,'tlh::BMatrix',OVERWRITE);
tlhumphrey2

Posts: 257
Joined: Mon May 07, 2012 6:23 pm

Tim,

I ran your MakeMatrix() code (changing to 10 rows and 20 columns) and saw that each row ends up with exactly the same data. Is that intentional?

HTH,

Richard
rtaylor
Community Advisory Board Member Posts: 1481
Joined: Wed Oct 26, 2011 7:40 pm

Richard,

No, numbers in different rows should not be the same. I'm surprised they are.

Could the problem be the NORMALIZE I have in the transform, genMatrix, of a 2nd NORMALIZE? Plus, maybe I'm thinking procedurally instead of declaratively???

Tim
tlhumphrey2

Posts: 257
Joined: Mon May 07, 2012 6:23 pm

Tim,

I'm not entirely sure why your code was getting the same row repeated each time, but here's my version of your code that DOES generate a different row for each record.
Code: Select all
MakeMatrix(UNSIGNED NRows, UNSIGNED NCols) := FUNCTION
rec0 := {REAL cell};
matrec := {SET OF REAL arow};
REAL RealRandom :=  RANDOM()/4294967295;
RealDS := DATASET(NRows*NCols,TRANSFORM(rec0,SELF.cell := RealRandom));

MatRec DS2Set(INTEGER C) := TRANSFORM
EndRec := C*NCols;
BegRec := (EndRec - NCols) + 1;
SELF.arow := SET(RealDS[BegRec..EndRec],cell);
END;
RETURN DATASET(NRows,DS2Set(COUNTER));
END;

NRows1:= 10;
NCols1:= 20;
NRows2:= NCols1;
NCols2:= NRows1;

A:=MakeMatrix(NRows1, NCols1);
OUTPUT(A,,'tlh::AMatrix',OVERWRITE);

B:=MakeMatrix(NRows2, NCols2);
OUTPUT(B,,'tlh::BMatrix',OVERWRITE);
Note that I'm just generating all the cells in one pass, then just splitting them into their separate recs.

HTH,

Richard
rtaylor
Community Advisory Board Member Posts: 1481
Joined: Wed Oct 26, 2011 7:40 pm

Richard,

I like your code. It does everything I wanted to do inside MakeMatrix. And, it is much shorter than my code.

I did notice one odd behavior of the code -- matrices A and B have the same values. Again, I don't understand this behavior. I entered a jira asking why (HPCC-19025).

Tim
tlhumphrey2

Posts: 257
Joined: Mon May 07, 2012 6:23 pm