Hi everyone,
Some of those who attended the lab session this morning might have experienced a little bit of confusion concerning the last part of the lab (Task 7.1.1). I do agree that it was a bit confusing! I was also confused, and maybe I made the situation more complicated as I was trying to clarify the task. Sorry about that!
However, please note that to implement a simple vectorized version of matrix multiplication ( matmul_sse() ), as the lab manual suggests, all we need to do is to unroll the inner loop of the reference implementation ( matmul_ref() ) four times. The reason we need to unroll it four times is that four floats (4x32 bits) could be packed inside an SSE vector (128 bits). In other words, all we need to do is to translate the matmul_ref() directly into vectorized version (taking into account the loop unrolling). As you can see, the reference implementation employs no optimization, such as matrix transposition, etc. Therefore, it is not needed to do any such optimization. Basically, we only need to translate the reference implementation into a vectorized one, as it is, operation by operation. Just focus on what will happen when you unroll the innermost loop (loop index j) four times, e.g. which operands should be duplicated, etc. As we are not doing any special optimization, therefore you should be able to implement it using the basic "_mm_add_ps" and "_mm_mul_ps" instructions.
Also please note that if the vector instruction you are using needs operands of type "__m128", then you cannot pass to it operands of type "float" !. You need to do type casting. You can get your operand by reference, and cast it to the correct type (i.e. use pointer type casting). However, please note that you might not necessarily need to do such a type casting. Your implementation might not need it at all and might work well without it!
There was a question by one of you, asking what happens when we do an SSE vector load from an arbitrary address in a matrix? Will a vector load return values from a row or a column? The answer is that it will return values from a row (i.e. different columns in the same row), since "c" compiler enforces row-major order. (you can read a short introduction to this here: http://en.wikipedia.org/wiki/Row-major_order )
Please let me know if you are facing any problems with the assignment.