Matrix Multiply Writeup

David Andre and Steven Czerwinski

See here for more information on this project.

We started this assignment by calculating the maximum convenient block size that would fit into the cache. We settled in on 24, and coded up a simple tiling scheme that broke things into X by X blocks and passed them to a routine that multiplied them. We got surprisingly little speed up from this on everything but the largest cases. In fact, when we tried a block size of 256, we got better performance than with a block size of 24x24. We then found some better optimizations for the compiler, including -x05 and -xrestrict. We also experimented with the pipeloop pragma, which seemed to have some beneficial effect. Then, we tried reordering the matrices so that each block would be a coherent block, and so that the B and C matrices were column major instead of row major. This had some effect, but not as much as we had hoped.


So, then we realized that because the ultras have a direct mapped cache, unless we rearranged the blocks in memory, it would often be the case that the subblock from A and the subblock from B would clash in the cache, knocking one another out. Thus, we allocated a block of numblocks*numblocks * 16K bytes of memory. Each of these 16K blocks of memory basically mimicked the cache for us. We divide up the 16K space so that we put an A block in the top third, a B block in the middle third, and a C block in the bottom third. In this way, no conflicts could occur between A,B, or C sub-blocks. This took us a little higher, but still not as high as expected.
Then, we realized that we had to pay some attention to the registers -- at the moment, our assembly code was filled with loads and was often stalling for things in the cache. So, we implemented a 4x4 second level of tiling inside of our other blocks. We moved to a 16x16 outer block at this point, and a 4x4 inner block. With this code, we moved into the near 100 MFLOPS area; still not quite where we wanted..

We then attempted to hand optimize the 16x16 code using the 4x4 inside blocks, and this proved difficult. The compiler was doing as good a job as we could by moving things. We made some improvements by restructuring our code slightly, and by adding in the -xrestrict flag, so that our code moved up to the 100-120 range.


We then experimented with rectangular blocks. We got the inspiration for this from the fact that we only used a single column of B's block at one time (it would be multipled by a whole square from A.) We thought we should just extend the number of columns we used arbitrarily to get better performance, but this did not work well.

We then tried using smaller inside blocks, and this seemed to work. We moved back to a 24x24 outer block, with 3x3 inner blocks. This made a large difference, and moved us to the 150-190 range for blocks divisible by 24.


We then realized that for the small size matrices, doing the conversions to column major and into our cache-sized buffers was overly expensive, especially for matrices that would fit into the cache anyway. So, we added a separate function for the matrices smaller than 24, and tried doing those in the original row major format, without any conversions. This, in addition to adding the -xunroll=24 option, moved us to a surprising 294 M Matops for size 24 matrices. In fact, as we found later, the memory reorganization is a large enough load that it does not benefit the program until matrix sizes of 50 x 50. In order to get around this, we added in code for processing matrices between 24 x 24 to 50 x 50 by using the original memory and organization that the matrices are passed in with. This gave us gains of 20 MFLOP or so, and the code was simple enough to implement.

At first, when we implemented the 3x3 inner multiple of the larger 24x24 blocks, we used a very general approach that could handle the cases when the larger blocks were not actually 24 (i.e., when they were on the edges of the matrix.) At some point, we decided to replace this general procedure with ones that could handle specific cases. When we did the major categories, we saw large improvements. For example, when we hard coded a procedure that assumed both blocks where 24x24 (the most common case in large matrices), we got improvements of 70 MFLOPs or so. This is because by hard coding the sizes into the procedure, the compiler could probably better organize the code, and know which loops to unroll and so on..

We took this strategy to heart, and decided to hard code most of the possiblities. So, we generated routines that did 1x24, 2x24, 3x24, and so on.. and also 24x1, 24x2 since the width and length of the block does matter. Of course, this lead to many new procedures, but since they were all inherently the same with many changes in variables, we decided to try it anyway. Unfortunately, we did not get as much performance from this as we hoped. Instead of the original 70 MFLOP gain we saw earlier, we only got maybe 15 to 20 MFLOP improvement with these new very specialized procedures. But, they weren't hurting performance either, so we left them in.


We have to make one note on that though. As we added more and more procedures, we noticed the variance in our performance increased. We attribute this to the fact the program has more instructions now that compete partly for cache space, which upsets our code. This would cause such things as the 16x16 case dropping from 250 MFLOPS to 120 on the next run because of minor coding changes. We didn't really know how to get around with this, so we simply tried to be careful about not adding too many procedures.


Overall, our algorithm consists of the following features: 24x24 outside blocks, 3x3 register efficient inner blocks, hard coded routines for common sized block multiplies, memory re-organization to align sub-blocks appropriately, and heavy use of compiler optimizations.

Things that worked:


Results:
-------
Checking for correctness on sizes: 214 188 183 234 159 17 180 134 147 101
16 230
32 186
64 178
128 198
256 188

23 282
47 174
99 201
121 202
181 179
231 181
251 184
We ran our own tests to make sure our 24x24 blocking was performing as expected... Just to let you know how it performs:

16 250
24 303
64 161
128 201
256 186
23 263
47 147
48 161
121 207
181 186
231 186
251 186