Essay
Consider a system with two multiprocessors with the following configurations:
(a) Machine 1, a NUMA machine with two processors, each with local memory of 512 MB with local memory access latency of 20 cycles per word and remote memory access latency of 60 cycles per word.
(b) Machine 2, a UMA machine with two processors, with a shared memory of 1GB with access latency of 40 cycles per word.
Suppose an application has two threads running on the two processors, each of them need to access an entire array of 4096 words, is it possible to partition this array on the local memories of the NUMA machine so that the application runs faster on it rather than the UMA machine? If so, specify the partitioning. If not, by how many more cycles should the UMA memory latency be worsened for a partitioning on the NUMA machine to enable a faster run than the UMA machine? Assume that the memory operations dominate the execution time.
-------
Correct Answer:

Verified
Suppose we have x words on one processor...View Answer
Unlock this answer now
Get Access to more Verified Answers free of charge
Correct Answer:
Verified
View Answer
Unlock this answer now
Get Access to more Verified Answers free of charge
Q1: Applying the send/receive programming model as outlined
Q2: Suppose we have a dual core chip
Q3: Why should there be stride-access for vector
Q4: How would you rewrite the following sequential
Q5: Consider the following GPU that consists of
Q6: Consider a multi-core processor with 64
Q7: Besides network bandwidth and bisection bandwidth, two
Q8: Vector architecture exploits the data-level parallelism to
Q9: Consider a multi-core processor with heterogeneous cores:
Q10: Consider the following code that adds two