Monday, February 17, 2020

Recursive Matrix and the parallel matrix multiplication using crossbeam and generic constant

This was planned project I posted before.
Basically in order to evaluate Rust's claim for zero cost abstraction and the effectiveness of parallel processing of Rust, I implemented novel Recursive Matrix and the parallel matrix multiplication.

Probably this recursive matrix multiplication would be implemented somewhere, but this is my original idea. this idea is re-interpreting matrix as matrix of its sub matrix. for instance 16X16 matrix can be interpreted as 2X2 matrix of 8X8 matrix. using this, we can reduce the size of matrix dimension.(e.g, 16 to 8 or 2). this has a good effect for large dimension matrix. often too large dimension will cause space problem also stack based array has a rather small limitation for the size of matrix. Also GPU has fixed size of matrix.(or max like 512 etc)
This reduction can be applied to the sub matrix itself recursively, we can reduce the  size of matrix under given limited size if we want.

This approach should be implemented in high level of abstraction, like Ring, Algebra, Matrix using generic trait/functions, moreover, the dimension should be part of generic parameter.

So the main idea is expressed in lift(lower) function:

pub fn lift<R: Ring, const DIM: usize, const DIM0: usize, const DIM1: usize>
(m: &ArrayMatrix::<R, DIM>)-> ArrayMatrix<ArrayMatrix<R, DIM1>, DIM0>;
   
pub fn lower<R: Ring, const DIM: usize, const DIM0: usize, const DIM1: usize>
(m: &ArrayMatrix<ArrayMatrix<R, DIM1>, DIM0>)-> ArrayMatrix::<R, DIM>;

this list function transform DIM = DIM0*DIM1 dimension matrix to DIM0 matrix of DIM1 matrix.
lower function is reverse direction.

This generic matrix is based on coeffient ring's operators. Since DIM1 matrix is ring, we can use the same generic function to implement this nested matrix.

For the parallel computation of matrix, we create cleanly separated jobs based on this nexted matrix multiplication.
For instance, 256 dimension matrix can be expressed by 2x2 matrix with 128x128 matrix.
let M = A*B, then M[i][j] = A[i][0]*B[0][j]+ A[i][1]*B[1][j], so we can create 4 jobs to calculate nested matrix entry. then these jobs are send to a channel and we can create any number of consumer threads, and they also post the result to another channel.
if there are only 4 jobs, it can only use 4 threads, so in order to maximize the CPU core usage, we should create 4x4 matrix with 64x64 matrix insted. in this case, this can use up to 16 CPU threads.

We will see such improvement for these cases although it will not be 8 time faster where we use 8 threads on 8 threads CPU.



No comments:

Post a Comment

Recursive Matrix and the parallel matrix multiplication using crossbeam and generic constant

This was planned project I posted before. Basically in order to evaluate Rust's claim for zero cost abstraction and the effectiveness o...