CS267 Final Project
CS267 Final Project
Final project: SVD on the T3E
Group members: Ketan Patel, Dane Morgan, Karin Lin
Advisors: Horst Simon, Osni Marques
Project Description
We are aiming to perform SVD on large sparse matrices containing
seismological data about the outer crust of the earth. The matrices
are approximately 10^6 by 2*10^5, and have about 3*10^7 nonzero
elements of 8 bytes each. We are interested in obtaining the largest
300 to 3000 singular values and their associated orthogonal vectors.
We will approach this problem using the Lanczos iterative algorithm for
finding the eigenvalues and eigenvectors of a symmetric matrix. Since
our matrix of data A is neither square nor symmetric, we will apply the
algorithm to the matrix M = (A^T)*(A). From the eigenvalues of M,
extracting the singular values of A should be a simple task.
The most time consuming step in the Lanczos algorithm is expected to
be the matrix multiplication required to construct the next vector in
the series of Lanczos vectors. Dr. Marques presently has a high
performance FORTRAN (HPF) code for executing this step in parallel,
but it fails for unknown reasons for large matrices and has
inefficiencies in its algorithm. We have developed what we believe
will be a more efficient algorithm for performing the matrix
multiplication step in parallel. We plan to implement this algorithm
on the T3E, using MPI, within the LANSO code. The LANSO code is a
state of the art sequential implementation of the Lanczos algorithm
written by B. N. Parlett and Z. Alex Liu. It should be possible to
interface with the LANSO code by writing an INIT and OP routine, for
initializing the matrix and performing the parallel matrix-vector
multiplication, respectively. Few modifications to the original LANSO
code will be necessary. These routines will then be portable to other
SVD codes, which will allow Dr. Marques to use them easily. If we
have time, we will attempt to parallelize other portions of the LANSO
code.
Implementation
Our primary variables are as follows:
- A_row[], A_col[], NZA[]: three parallel arrays containing respectively
the row index, column index, and value of the nonzero elements of A
- X[]: a trial eigenvector of M = (A^T)*(A)
- NP: the total number of processors
Description of the subroutine INIT
This routine performs the MPI initialization steps as well as distributing
the matrix A. Key procedures are:
- Perform MPI initialization.
- Determine number of nonzeroes in A.
- Let each processor read in its portion of A (which is stored in row
major order). Note that rows of A may be broken across processors.
The storage of A_row, A_col, and NZA requires a total of 3*(3*10^7)*(8 bytes)
= 720MB.
Description of the subroutine OP
This routine performs the matrix-matrix-vector multiplication (A^T)*A*X.
Key procedures are:
- Broadcast the trial eigenvector X to each processor.
- Perform vector-vector multiplication of the rows of A and X in parallel,
storing partial sums locally (partial sums are necessary since rows may be
broken across processors).
- Accumulate partial sums and communicate results back to the relevant
processors.
- Perform vector-vector multiplication of the rows of A^T (which are the
columns of A) and A*X. Store partial sums locally.
- Accumulate partial sums and send the resulting vector back to processor 0.
A detailed description of these procedures, as well as some communication
estimates, can be found here.
Schedule
Our schedule is as follows:
Initial phase: deciding groups, exact specifications of project, getting
codes and data. Completed.
4/25-5/2 Phase 1: Get the LANSO code running on the T3E and start to code
INIT and OP subroutines.
5/2-5/16 Phase 2: Finish coding INIT and OP routines and interface them
with LANSO. Start performance measurements.
5/16-5/23 Phase 3: Finish performance measurements. Perform further
parallelization of LANSO code. Prepare presentation and final report.
5/23 Phase 4: Turn in report and give presentation.