Tuesday, November 25, 2008


No, not this post. That's the name of the algorithm for deciding which tasks to speculatively execute. That is the essence of this paper, and unlike most of the other papers we've read, which have, for the most part, either speculated about or extrapolated their results to the internet, it has a lot of actual experiments to back it up.

The basic idea is that the MapReduce scheduler has a few underlying assumptions that cause problems when working on heterogeneous systems. Among these assumptions are that all nodes execute tasks at more or less the same pace, and that all phases of reduction operate at roughly the same speed.

The first assumption leads to problems when detecting stragglers. A node that runs twice as slow as average (possibly due to resource contention) is treated essentially the same as a node that runs ten times as slow as average. In the original scheduler, all "stragglers" were treated the same, and were assigned to idle nodes based on locality. The LATE scheduler prioritizes these nodes by estimated finish time, which they calculate by using the work completion estimate divided by the time spent to guess a rate at which the node processes work, and then to extrapolate its time remaining.

The second assumption can lead to scenarios in which a certain percentage of tasks raise the average significantly. For example, the copying, sorting, and reducing phases are considered to each be 1/3rd of the work. However, the copying phase usually takes the longest by a significant margin. In particular if a job finishes almost immediately after the copying phase, this means that some jobs will be at 100% complete, while others will not yet be at 33%. This disparity causes a spike in the average % completion, which then labels all uncompleted jobs stragglers, thereby causing excessive speculative executions.

LATE also fixes a problem that arises when jobs do not finish in waves. Under normal circumstances, jobs all take roughly the same amount of time, and finish roughly at the same time. Thus, a job that is taking longer than others to finish can safely be considered a "straggler". But in a heterogeneous system, this is not necessarily true, which is why the LATE scheduler's completion time heuristic performs better than the previous heuristic, which just measured progress.

Their results show that they almost always do better than both the original scheduler, as well as the scheduler with no speculative execution for the best, worst, and average cases. In fact, they do better by a factor of two in some cases. There is some talk about extending the estimated completion time calculation to understand different "phases", but that seems needlessly complicated. However, it is noted that this scheduler may have problems if early phases run faster than late phases.

Monday, November 24, 2008

Policy Aware Switching

I would just like to start by saying that while this paper presents a nice system, it is relatively long by virtue (or vice) of verbosity.

This paper essentially presents a system similar to the DOA we studied a few weeks ago. The idea is that there are p-switches that delegate the standard middlebox functions to boxes that are not on the path into the datacenter. This usage is slightly more easily applied, because it does not require a new naming scheme, and can be implemented one datacenter at a time.

Each p-switch has a configuration that basically uses the source MAC address and 5-tuple of each packet to decide where the packet should go next. The 5-tuple is the list of 5 datapoints: source ip, source port, destination ip, destination port, transport protocol. The way the configuration works is that each 5-tuple, source pair has a "next hop" associated with it (use of wildcards allowed). Thus, a packet that comes from (for example) a firewall, even though none of its 5-tuple fields has changed, can still be directed to the appropriate next hop (for example, a load balancer) because the p-switch can deduce its source.

This system is also capable of succeeding in situations where the packets are addressed to one of the middleboxes (say, a load balancer) which will then rewrite its IP header by using a clever combination of wildcards and extra rules to make sure that each packet passes through the same middleboxes on the way in and on the way out.

Rules can have multiple values for the "next hop" field, allowing it to load-balance the work of a single middlebox across multiple middleboxes of the same type. To make sure that all packets from a single flow go through the same middlebox, the p-switches use consistent hashing.

I have nothing bad to say about this paper, except that the recovery time from failures appears to be as large as 3 seconds. That's a huge amount of downtime or lost packets for a datacenter. Of course, that may be a result of the implementation, but it would be nice to see that number reduced.

Thursday, November 20, 2008


Note: I lost internet in between starting and finishing this post. I've had internet for a minute or two right now.

This paper essentially presents a protocol for accomplishing multicast in which there is a designated sender, and all other nodes are receivers; in other words, they are subscribing to the data being sent.

To accomplish this in an efficient manner, the authors suggest a dynamically constructed tree of bounded degree. Given a constant k, they wish to bound the degree by k and 3k-1. This means that it is always possible to split a node that is too large, and it is always possible to merge two nodes if one has become too small (if their sum exceeds 3k-1, split the large one, and then merge with either of the resulting halves). This is essentially a concept stolen from B-trees.

The basic concept behind multicast in this setup is that messages propagate from the root to their leaves. Each new node joins at the lowest level of the tree. Each group of nodes in the tree has a leader, which is not only a member of the lowest level, but also a member of the level one above. This continues recursively until there is only 1 root. The leader of a group is its geographic center; this enables a new node to find a group in O(log(n)) time by querying nodes that are successively lower in the hierarchy for their list of children. With each query, the new node chooses the child closest to itself, and eventually ends up in the lowest level of the structure with the leader that is closest to itself.

Nodes in the same group send eachother heartbeat messages. When a node goes silent for long enough, it is considered dead. When the leader goes silent, a new leader is elected through a distributed algorithm that was not entirely clear to me.

They run some simulations, and find that NICE performs slightly better than Narada in all metrics, including stress and path length, and does significantly better in terms of control overhead. But all simulations run with 128 nodes, and this seems kind of small for the kind of multicast that they're discussing (news feeds, stock ticker, etc.).

The one thing that I think could be seriously improved is the criteria for leader selection. Certainly, it is important to have a leader who is geographically close to all nodes in the group. However, it also seems important to have a leader who has a reliable internet connection. After all, if packets don't get to the leader, they will not get to the leaves; but if the leader is simply unreliable and not dead, then that particular node will continue to be the leader and to do a disservice to the rest of the nodes in his group. Some kind of redundant leadership might be nice in this situation.


This paper essentially describes an implementation of a reliability layer on top of a multicast protocol. The authors state that they will implement reliability by avoiding ACKs (and the ACK implosion problem), as well as sender-based reliability protocols (since this would mean the sender would have to track participation).

In the model that they use for multicast, participants in a multicast group can leave and join at will. All messages are deliverd to all participants, and all participants are both senders and receivers. The actual multicast is achieved using some sort of structure for message propagation; this structure can be a chain, a star, a tree, or anything else.

The first part of this paper is loss detection and recovery - each participant has a unique ID, and their messages are sent out with their ID as well as a sequential number. These numbers apply to distinct application-level messages, rather than packets. Whenever a node receives two non-sequential numbers, it assumes that there has been a loss, and prepares to request a recovery. It waits a random amount of time that is also dependent upon its "distance" from the sender.

By randomizing the amount of time that they wait, these nodes help avoid the synchronization problem. When a node sends a request for missing data, it is multicast like all other messages. This means that other nodes who want the same data can now back off (reset and double the timers on their requests) so as to have a minimal number of requests for each piece of missing data. Any node that hears a recovery request and has the requested data can respond to the request, thereby (hopefully) improving the response time for these requests. Not pictured here: RTT estimator (that doesn't require synchronized clocks), race condition avoidance (wait 3*RTT estimate).

They then do a couple of simulations of the algorithm's performance on various different topologies. They conclude that the number of retransmissions per lost data message is low, the amount of time to recover a lost message (divided by the RTT time) is low, and so forth. However, as they said in the paper, "We do not claim to be presenting realistic topologies or typical patterns of packet loss.". So in other words, this is all pretty suspect.

Lastly, they present a theoretical improvement where the wait time is adjusted based on the number of duplicate requests that a node sees, as well as the average request delay that it has. They then consider using TTL values to localize recovery requests and retransmissions (under the assumption that a loss implies a single failure somewhere in the topology rather than general network unreliability).

In general, I liked the first part of this paper a lot more than the rest of it. It presented an algorithm and implementation that was actually in use, and served a very specific purpose. As they state in the paper, it is unlikely that anyone will create a multicast application that satisfies everyone's needs, so it is better to have a more application-specific framework for multicast.

Monday, November 17, 2008


This paper is kind of short (I'm not complaining), and generally outlines difficulties with modern routing across the internet, especially with respect to wireless networks, ad-hoc networks, and sensor networks. It then suggests a partitioning system in order to deal with these difficulties, and presents some design decisions that it thinks would be best.

The first thing that is suggested is a partitioning into various independent networks, which can be addressed by name. One of these, of course, would be the internet as we know it today; the rest would be networks such as military ad-hoc networks. The key to communications between different networks would be DTN gateways. In general, a message would be destined for a particular network, with a particular address within the network. The DTN gateway then do late resolution, where the latter half of the address is resolved once the data enters the appropriate network.

I like some of these ideas, and must disagree with others. Partitioning the network into multiple address spaces is exactly the kind of thing that the original internet was supposed to overcome. The idea of IP over everything still has merit (in my personal opinion). However, I think the idea of using DTN gateways to control traffic moving into a wireless network is a relatively good idea, for reasons which are below.

The main idea behind DTN gateways is that they will control all traffic headed into their particular network. This includes long-term storage for data that is destined for nodes that are currently disconnected, as well as address translation/lookup. The long-term storage makes reliability much easier for long-term connections (essentially, the responsibility for delivering the data is delegated to the DTN gateway), and address translation is good for networks such as sensor networks, which like to use address compression to save bandwidth.

The authors have a security suggestion that involves edge routers in the internet verifying public-key encrypted information from end nodes. This seems like a generally bad idea, given the intensity of computation, and the ability to crack public-key encryption given sufficient time and resources (of course, it depends on the strength of the encryption). Still, they recognize the need for authentication, which is something that the internet doesn't do well right now.

On the whole, I'm not sure this paper provides very good solutions to today's problems, but it does seem to identify the problems very well.


The data-oriented transfer service is essentially an abstraction layer that abstracts the transfer of large files into a service that can be used via API calls at the application layer.

The basic concept behind DOT is that file transfer is used by a large number of protocols (HTTP, FTP, SMTP, etc.), each of which could benefit from the separation between the transfer of metadata and the transfer of the actual file. DOT suggests that each file transferred should get a unique ID, and that a sender should upload files to DOT, and then pass a unique ID (as well as some hints) to the receiver, who can then receive the file from DOT.

The suggested benefits of this scheme are:
1) abstraction - the underlying transfer method can be changed from a TCP connection to a torrenting scheme without any trouble.
2) caching - by caching the files at local nodes, you can reduce the overall bandwidth use of multiple file transfers
3) faster transfer - in theory DOT can split a file up and send it to its destination over multiple links, thereby getting the file there faster

The authors claim that this file transfer scheme will be widely applicable. Actually, I think this transfer scheme is very narrowly applicable. Certainly, for large e-mail attachments, this scheme has SOME merit (see discussion on expiration below). However, the authors provide nearly no performance data, and they avoid discussing performance for small files. This leads us to believe that for small files (such as webpages, which constitute a good chunk of internet traffic), the DOT scheme has too much overhead to actually be useful. In the world of the web, it is probably best suited to embedded videos, which are usually handled by content distribution networks anyway.

There is a serious flaw in DOT concerning the expiration and caching of data. Let's begin with caching. As we saw in the DNS caching paper, most of the requests that go out across the internet are to servers that are visited exactly once. The rest are to a few very popular servers. It is unreasonable to expect caching of large files to be effective in a scenario where caching of DNS entries is not effective. Sure, DOT could cache files for very popular websites, but this would in essence be duplicating work already done by CDNs.

The expiration of data in DOT is also a serious issue. According to this paper, data expires when the sender sends notice to the DOT that the data need no longer be stored. This makes the DOT vulnerable to denial of service attacks by clients that upload data, and then never mark it as deletable. Furthemore, consider the e-mail sender we discussed above. Suppose he sends an e-mail with a large attachment out to a large number of recipientns. When is the correct time to mark the attachment as deletable? It is likely to be AFTER all recipients have downloaded the attachment. Of course, it is very difficult for the sender to know when all the recipients have finished downloading the file; this would likely require some sort of acknowledgement of receipt of the file. And, of course, if some of the e-mails bounce, or a single address forwards to multiple recipients, this will make counting much more difficult. In other words, the protocol must be modified in order to safely deploy DOT. Or, the sender can just never mark his attachment for deletion.

I'm not actually sure what kind of client would regularly use DOT to transfer files. BitTorrent is fairly dominant in the large file transfer market, and the rest of file transfers are likely to be over HTTP, in which case the files are likely to be small, and the overhead is likely to be significant. I'm not sure what problem the DOT system solves.

Thursday, November 13, 2008


Basically, what is being proposed here is a system that will be present throughout all the different levels of any internet application so as to be able to help people debug problems that they encounter. The idea is that the internet is getting REALLY complex (which is very true - just look at how much material we have for this class), and therefore, we need some sort of tool that will help us debug problems.

X-Trace generally assumes that the flow of packets across the internet will generate a tree of causality. There's a very good example of this in the paper, with the root being a web browser, and the leaves being a tree of DNS requests, and a tree with an HTTP request that leads to an Apache call. X-Trace accomplishes this kind of tracing with some metadata that is kept by all levels of a request. The metadata includes a unique identifier, an operation identifier, a parent id, and some options. The unique identifier allows reports from the same request to be grouped together, whereas the parent id allows these reports to be arranged into the appropriate tree structure.

Thus, for an intermediate device, application, or library to implement X-Trace, it must essentially implement two different calls: PushNext() and PushDown(). PushNext() indicates that it is pushing the X-Trace metadata to the next logical node in the eventual X-Trace tree. PushDown() indicates that the metadata is being pushed to a lower layer.

Once everyone has implemented these calls, they can generate reports for each X-Trace request. These reports are sent to a local handler (for security purposes). The X-Trace metadata also has a tag that allows them to (optionally) send the report to the original requester. This step is optional, since a company like Wikipedia or Google may not want to reveal any information about their internal composition.

The major issue with X-Trace seems to be deployment. For some protocols, such as HTTP, deployment is made easy through the use of extension headers. For others, such as TCP, this becomes much more difficult, and requires some changes to the Linux kernel. For UDP, it is impossible without either changing the protocol itself, or adding a tiny layer on top of it. If X-Trace can be deployed across the internet, and protocols can be altered so as to support it, then it will be an invaluable debugging tool. However, the likelihood of such a deployment seems small.