Essay
Consider the following code that adds two matrices A and B and stores the result in a matrix C:
for (i= 0 to 15) {
for (j= 0 to 63) {
C[i][j] = A[i][j] + B[i][j];
}
}
If we had a quad-core multiprocessor, where the elements of the matrices A, B, C are stored in row major order, which one of the following two parallelizations is better and why ? What about when they are stored in column major order ?
(a) For each Pk in {0, 1, 2, 3}:
for (i= 0 to 15) {
for (j= Pk*15 + Pk to (Pk+1)*15 + Pk)
{
// Inner Loop Parallelization C[i][j] = A[i][j] + B[i][j];
}
}
(b) For each Pk in {0, 1, 2, 3}:
for (i= Pk*3 + Pk to (Pk+1)*3 + Pk) {
// Outer Loop Parallelization for (j= 0 to 63) {
C[i][j] = A[i][j] + B[i][j];
}
}
Correct Answer:

Verified
When they are stored in row major order,...View Answer
Unlock this answer now
Get Access to more Verified Answers free of charge
Correct Answer:
Verified
View Answer
Unlock this answer now
Get Access to more Verified Answers free of charge
Q1: Applying the send/receive programming model as outlined
Q2: Suppose we have a dual core chip
Q3: Why should there be stride-access for vector
Q4: How would you rewrite the following sequential
Q5: Consider the following GPU that consists of
Q6: Consider a multi-core processor with 64
Q7: Besides network bandwidth and bisection bandwidth, two
Q8: Vector architecture exploits the data-level parallelism to
Q9: Consider a multi-core processor with heterogeneous cores:
Q11: Consider a system with two multiprocessors with