High Performance Linear Algebra
This course will cover topics in Linear Algebra, specifically concerning the way to implement them on modern computer architectures and clusters.
The course will cover both the theoretical aspects underlying the routine we want to implement, both the the implementation aspects that are needed
to achieve high performance; while specifying what do we actually mean by that.
Main list of the topics
- Vector and Matrix Products
- Inner product
- Outer product
- Matrix-vector product
- Matrix-matrix product
- Row, column, and submatrix partitioning
- LU Factorization
- ijk forms of Gaussian elimination
- Row, column, and submatrix partitioning
- Partial pivoting and alternatives
- Cholesky Factorization
- ijk forms of Cholesky factorization
- Memory access patterns
- Data dependences
- Triangular, band and tridiagonal Systems
- Row vs. column partitioning
- Fan-in and fan-out algorithms
- Wavefront algorithms
- Cyclic algorithms
- Cyclic reduction
- Sparse BLAS
- Matrix-vector product
- Matrix-matrix product
- Matrix storage formats
- Implementation of Krylov iterative solvers
- Distributed Sparse and Dense BLAS
- Data distribution
- Handling of the parallel environment
- Programming models
- How to write efficient threaded and GPU accelerated implementation of the different LA routines.
The final exam consists of the implementation and testing of a scalable algorithm for linear algebra, or the discussion and use within an application.