|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
This document presents our writeup for assignment #4. We were successful at getting the code to work, and were able to implement several optimizations. We found overall that the communication inefficiencies were the easiest to eliminate, although overlapping communication with computation caused some difficulties in implementation.
We chose to implement the conjugate gradient algorithm using MPI. Our general approach to solving the problem was to follow Steve's plan from the last assignment of first getting a simulated parallel system to work on a single node, where we simulate the message passing by using buffers. This allowed us to debug the code easily on a single node.
We attempted to create an efficient conjugate gradient algorithm with two sets of optimizations. First, to have efficient serial code, we utilized the sparse matrix representation discussed in section. We based our inner matrix multiply loop on Eun-Jin Im's code, and attempted to use the compiler to further improve the serial code. The second set of optimizations was to improve the communication. Here, we focused on reducing the amount of data and the number of messages being sent by exploiting the sparsity structure, but spent some effort on overlapping communication and computation and on reordering matrices when they were not very diagonal (to attempt to improve locality). We also attempted to perform some load balancing.
We implemented the sparse matrix representation discussed in section as a C++ class, and added some useful member functions to make some of our coding tasks easier, including creating random square symmetric matrices.
We chose to focus on creating code that provided the most speedup for the largest matrices (namely, the one large matrix we could read in, bcsstk17.rsa). We also did some testing on random matrices of various sizes. With more time it would have been good to improve the speedup for the smaller matrices as well.
We first discuss our basic parallel code and then show its performance. Similarly, we report on the performance of the system after each successive optimization. For nearly all of our tests, we used the large matrix in the class directory that we were able to get to work (we were not able to get the .dat files to work properly). We also used some randomly created symmetric matrices with our later optimizations.
To implement the basic parallel model, we introduced the Node Class, which stores the relevant information about a node, including which pieces of the matrix it is supposed to have and the general state of a given processor.
The communication model we used for this basic version was to have each processor store a block of rows of the A matrix and the corresponding block of the p (result) column vector. In this version, to communicate the p vector, we do a vector gather (using the MPI fuction MPI_Allgatherv()). Each processor of course does the SAXPYs alone, and the dot products are done by having each processor compute their part of the sum and then doing a reduce (sum).
This was straightforward to get going using our own randomly created matrices using our uni-processor simulated parallelism model. Once working on that model, it only took about an hour to put in the MPI code and have it working on multiple processors on the NOW. Getting the HB files to read in correctly took more hacking, and took realizing that although the arrays were 0-based, all of the values (row starts and colIndexes) were offset by 1.
This code wastes a large amount of time sending portions of the p vector to processors that have no need for them due to their sparsity structure and is therefore inefficient in the amount of data being sent.
The preceeding figure shows a speed up and timing curve for this version.
The preceeding figure shows the relative speedup for the computation, communication, and overall part of the code.
We noticed from the last set of runs that the amount of time spent waiting for other nodes to complete seemed to be quite large. The gather times in the graph above are one sign of this. We looked at the number of nonZeros per processor, and found that it varied by as much as a factor of two. We then implemented a routine to cause the number of nonZeros to be roughly equal on the different processors by changing the number of rows allocated to each processor. Our scheme caused the number of nonZeros per processor to be nearly equal.
Given that the amount of time spent in the matrix vector multply is proportional to the number of nonZeros, it makes sense to even out the nonZeros, although it does create some imbalance in the various dot products. However, this should be considerably less of the computational effort.
The preceeding figure shows the speedup and timing curves for this version.
The preceeding figure shows the relative time spent in each area of the code.
The previous versions of the code wasted a considerable amount of time in communication (in the Gather part of the code in the graphs above) due to the fact that they sent unnecessary data to each node. In fact, much of the time, entire vectors of data (each hundreds if not thousands of doubles) were being sent to processors that would never use them (due to the sparsity structure of most of the benchmarks). Thus, we decided to change from using a Vector Gather operation to a set of selected send and receives consisting only of the parts of the vectors that mattered.
To do this, we precomputed the information required to compact (and expand) the p vectors going to and coming from each of the nodes. Thus, at the start of the loop, each processor would send the appropriate piece of its part of the p vector to each other node. Each processor would then wait for all of the pieces being sent to it. Upon receiving them, it would expand them into its copy of the p vector and would then perform the matrix vector multiply. Thus, we did not overlap any communication with computation at this point and had to spend some time expanding and compressing the p vector. However, this optimization allowed us to acheive a huge reduction in the communication time, as expected.
The previous figure shows the amount of time spent in each area of the code.
The previous figure shows the overall speedup/timing curves for this version.
The next thing that we tried was to overlap communication and computation by performing computations as soon as the relevant parts of the p vector were available. The idea is that each processor does a non-blocking receive for its portions that need to come in from the other nodes, then does non-blocking sends of all of its portions to the appropriate other nodes. Then, the processor does the computation that it can do with its own part of the p vector. This can often be the majority of its computation, especially when the matrix is largely diagonal (as many of the benchmark cases are). After doing its own computation, each processor does a MPI_Waitany() for the messages coming from the other processors, and upon receiving one, does the computation that it can do with it.
In order to make this change, we had to change the way that we did the matrix vector multiply, as we no longer had the entire p vector availabe when we did a part of the matrix multiply, but instead had to do only a portion of the matrix multiply and then sum up the results from each portion. This caused a fairly significant bit of rewriting and subsequent debugging. Another issue is that the new method of doing the matrix vector multiply introduces some extra overhead when the number of nonzeros in a portion is small. This caused a bit of a slowdown in computation times, but a reduction in communication times. The benefit from the communication times was apparent when using more than approximately 7 nodes.
Also, we modified the order in which half the
processors performed their sends. With no reordering, all but one of the
nodes would send messages to the zeroth node first, forcing computation
to wait until all the nodes had finished sending. We examined this issue
by first setting up the sends so that half the processors first sent to
processor 0, and the other half first sent to the processor N-1.
The following graphs show results using this half and half approach.
This graph shows the speedup for this version. Note that it is achieving better speedup than the previous version.
The above graph shows the break down of relative time spent in each portion of the code.
The above graph shows the percentage of communication time spent in each part of the code.
We noticed that the change in the sending orders seemed to improve the overall performance, so we continued the idea by having each processor send first to a different processor, so that every processor will have something to do as quickly as possible. Clearly, the sparsity structure of the matrix affects the optimal ordering of the sends, and we did not attempt to dynamically adapt the reordering to the sparsity structure. This change reduced the amount of time spent waiting and the overall amount of communication time, as shown in the graph below:
For comparison of the relative importances of the different approaches, here are comparison plots of speedup and time for each of the methods tried so far:
These tests are underway, but have been hampered by the overuse of our cluster on the NOW.
Communication is only necessary when vector elements belonging to one
processor are required
for the computation of result vector values belonging to another processor.
Therefore, communication can avoided if all matrix elements lie in blocks along
the diagonal. While in general it is not possible to move all matrix elements
into such blocks, there can be a lot of room for improvement, and any
movement of elements from outside to inside such diagonal blocks will reduce
the amount of communication necessary.
To this end, we developed a matrix (or linear system) reordering routine which would take a matrix and a number of processors and attempt to move as many elements as possible into blocks along the diagonal. Since the matrices here are symmetric, it was necessary to swap rows and columns simultaneously. With a simple scoring metric - number of elements inside appropriate blocks minus number outside - reasonably good reordering was obtained. For example, for a randomly generated sparse matrix with 1000 nonzero elements, typically around 100-200 elements could be moved from outside to inside diagonal blocks. For most of the provided matrices, which already have elements highly concentrated along the diagonal, reordering would not be expected to help.
Reordering proved to be a time consuming task - prohibitively so for the largest of matrices. It did result in some speedup, but not the dramatic effect that might have been expected.
Typically for larger matrices, it was possible to move around 10% of the elements from bad locations in the matrix to good ones; this resulted in an overall speedup of also around 10% in most cases.Below are speedup results comparing the speedup results with and without reordering on a randomly generated matrix. The density of these matrices is significantly higher than that of the sample matrices, as these gave the best results for reordering. We expect that the effect of reordering would be at least as significant for very large, sparser matrices, but the up-front cost of reordering makes this difficult to test.
The binary for our code can be found here.
A tarred and gzipped version of our code can be found here..
We had some difficulty in reading in the .dat files, and thus were limited to a small set of the available matrices. Thus, we used the .rsa matrices in the class directory as well as random matrices of our own creation. One potential problem with the matrices in the class directory is that they are quite diagonal, which may not be representative of all sparse matrix problems. We thus used random matrices that were more uniformly distributed.
We are grateful to Hanna Pasula for finding and fixing the bug in the iohb.c code -- this greatly helped our ability to read in the matrices. We used the the iohb code that was provided off of the class home page, as well as the matrices presented there.