Essay
Consider a multi-core processor with 64 cores where first shared cache is at level L3 (L1 and L2 are private to each core). Suppose an application needs to compute the sum of all the nodes in a perfectly balanced binary tree T (A tree where every node has either two or zero children and all the leaves are at the same depth/level). Assuming that an add operation takes 10 units of time, the total sum can be computed sequentially, in time 10*n units, ignoring the time to load and traverse links in the tree (assume they are factored in the add). Here n is the number of nodes in T. One way to compute the sum in a parallel manner is to have two arrays of length 64: (a) an input array having the roots of 64 leaf subtrees (b) an output array that holds the partial sums computed for the 64 leaf subtrees for each core. Both the arrays are indexed by the id (0 to 63) of the core that is responsible for that entry. Furthermore, assume the following:
The input arrays are already filled in with the roots of the leaf subtrees and are in the L1 caches of each core.
Core 0 is responsible for handling the internal nodes that are not a part of any of the 64 subtrees. It is also responsible for computing the final sum once the partial sums are filled in.
Every time a partial sum is filled in the output array by a core, another core can fill in its partial sum only after a minimum delay of 50 units (due to cache invalidations).
In order to achieve a speedup of greater than 2 (over sequential code), what is the minimum number of nodes that should be present in the tree?
Correct Answer:

Verified
Let n be the total number of nodes in th...View Answer
Unlock this answer now
Get Access to more Verified Answers free of charge
Correct Answer:
Verified
View Answer
Unlock this answer now
Get Access to more Verified Answers free of charge
Q1: Applying the send/receive programming model as outlined
Q2: Suppose we have a dual core chip
Q3: Why should there be stride-access for vector
Q4: How would you rewrite the following sequential
Q5: Consider the following GPU that consists of
Q7: Besides network bandwidth and bisection bandwidth, two
Q8: Vector architecture exploits the data-level parallelism to
Q9: Consider a multi-core processor with heterogeneous cores:
Q10: Consider the following code that adds two
Q11: Consider a system with two multiprocessors with