Tuesday, November 25, 2008

LATE

No, not this post. That's the name of the algorithm for deciding which tasks to speculatively execute. That is the essence of this paper, and unlike most of the other papers we've read, which have, for the most part, either speculated about or extrapolated their results to the internet, it has a lot of actual experiments to back it up.

The basic idea is that the MapReduce scheduler has a few underlying assumptions that cause problems when working on heterogeneous systems. Among these assumptions are that all nodes execute tasks at more or less the same pace, and that all phases of reduction operate at roughly the same speed.

The first assumption leads to problems when detecting stragglers. A node that runs twice as slow as average (possibly due to resource contention) is treated essentially the same as a node that runs ten times as slow as average. In the original scheduler, all "stragglers" were treated the same, and were assigned to idle nodes based on locality. The LATE scheduler prioritizes these nodes by estimated finish time, which they calculate by using the work completion estimate divided by the time spent to guess a rate at which the node processes work, and then to extrapolate its time remaining.

The second assumption can lead to scenarios in which a certain percentage of tasks raise the average significantly. For example, the copying, sorting, and reducing phases are considered to each be 1/3rd of the work. However, the copying phase usually takes the longest by a significant margin. In particular if a job finishes almost immediately after the copying phase, this means that some jobs will be at 100% complete, while others will not yet be at 33%. This disparity causes a spike in the average % completion, which then labels all uncompleted jobs stragglers, thereby causing excessive speculative executions.

LATE also fixes a problem that arises when jobs do not finish in waves. Under normal circumstances, jobs all take roughly the same amount of time, and finish roughly at the same time. Thus, a job that is taking longer than others to finish can safely be considered a "straggler". But in a heterogeneous system, this is not necessarily true, which is why the LATE scheduler's completion time heuristic performs better than the previous heuristic, which just measured progress.

Their results show that they almost always do better than both the original scheduler, as well as the scheduler with no speculative execution for the best, worst, and average cases. In fact, they do better by a factor of two in some cases. There is some talk about extending the estimated completion time calculation to understand different "phases", but that seems needlessly complicated. However, it is noted that this scheduler may have problems if early phases run faster than late phases.

No comments: