Tuesday, November 25, 2008

LATE

No, not this post. That's the name of the algorithm for deciding which tasks to speculatively execute. That is the essence of this paper, and unlike most of the other papers we've read, which have, for the most part, either speculated about or extrapolated their results to the internet, it has a lot of actual experiments to back it up.

The basic idea is that the MapReduce scheduler has a few underlying assumptions that cause problems when working on heterogeneous systems. Among these assumptions are that all nodes execute tasks at more or less the same pace, and that all phases of reduction operate at roughly the same speed.

The first assumption leads to problems when detecting stragglers. A node that runs twice as slow as average (possibly due to resource contention) is treated essentially the same as a node that runs ten times as slow as average. In the original scheduler, all "stragglers" were treated the same, and were assigned to idle nodes based on locality. The LATE scheduler prioritizes these nodes by estimated finish time, which they calculate by using the work completion estimate divided by the time spent to guess a rate at which the node processes work, and then to extrapolate its time remaining.

The second assumption can lead to scenarios in which a certain percentage of tasks raise the average significantly. For example, the copying, sorting, and reducing phases are considered to each be 1/3rd of the work. However, the copying phase usually takes the longest by a significant margin. In particular if a job finishes almost immediately after the copying phase, this means that some jobs will be at 100% complete, while others will not yet be at 33%. This disparity causes a spike in the average % completion, which then labels all uncompleted jobs stragglers, thereby causing excessive speculative executions.

LATE also fixes a problem that arises when jobs do not finish in waves. Under normal circumstances, jobs all take roughly the same amount of time, and finish roughly at the same time. Thus, a job that is taking longer than others to finish can safely be considered a "straggler". But in a heterogeneous system, this is not necessarily true, which is why the LATE scheduler's completion time heuristic performs better than the previous heuristic, which just measured progress.

Their results show that they almost always do better than both the original scheduler, as well as the scheduler with no speculative execution for the best, worst, and average cases. In fact, they do better by a factor of two in some cases. There is some talk about extending the estimated completion time calculation to understand different "phases", but that seems needlessly complicated. However, it is noted that this scheduler may have problems if early phases run faster than late phases.

Monday, November 24, 2008

Policy Aware Switching

I would just like to start by saying that while this paper presents a nice system, it is relatively long by virtue (or vice) of verbosity.

This paper essentially presents a system similar to the DOA we studied a few weeks ago. The idea is that there are p-switches that delegate the standard middlebox functions to boxes that are not on the path into the datacenter. This usage is slightly more easily applied, because it does not require a new naming scheme, and can be implemented one datacenter at a time.

Each p-switch has a configuration that basically uses the source MAC address and 5-tuple of each packet to decide where the packet should go next. The 5-tuple is the list of 5 datapoints: source ip, source port, destination ip, destination port, transport protocol. The way the configuration works is that each 5-tuple, source pair has a "next hop" associated with it (use of wildcards allowed). Thus, a packet that comes from (for example) a firewall, even though none of its 5-tuple fields has changed, can still be directed to the appropriate next hop (for example, a load balancer) because the p-switch can deduce its source.

This system is also capable of succeeding in situations where the packets are addressed to one of the middleboxes (say, a load balancer) which will then rewrite its IP header by using a clever combination of wildcards and extra rules to make sure that each packet passes through the same middleboxes on the way in and on the way out.

Rules can have multiple values for the "next hop" field, allowing it to load-balance the work of a single middlebox across multiple middleboxes of the same type. To make sure that all packets from a single flow go through the same middlebox, the p-switches use consistent hashing.

I have nothing bad to say about this paper, except that the recovery time from failures appears to be as large as 3 seconds. That's a huge amount of downtime or lost packets for a datacenter. Of course, that may be a result of the implementation, but it would be nice to see that number reduced.

Thursday, November 20, 2008

NICE

Note: I lost internet in between starting and finishing this post. I've had internet for a minute or two right now.

This paper essentially presents a protocol for accomplishing multicast in which there is a designated sender, and all other nodes are receivers; in other words, they are subscribing to the data being sent.

To accomplish this in an efficient manner, the authors suggest a dynamically constructed tree of bounded degree. Given a constant k, they wish to bound the degree by k and 3k-1. This means that it is always possible to split a node that is too large, and it is always possible to merge two nodes if one has become too small (if their sum exceeds 3k-1, split the large one, and then merge with either of the resulting halves). This is essentially a concept stolen from B-trees.

The basic concept behind multicast in this setup is that messages propagate from the root to their leaves. Each new node joins at the lowest level of the tree. Each group of nodes in the tree has a leader, which is not only a member of the lowest level, but also a member of the level one above. This continues recursively until there is only 1 root. The leader of a group is its geographic center; this enables a new node to find a group in O(log(n)) time by querying nodes that are successively lower in the hierarchy for their list of children. With each query, the new node chooses the child closest to itself, and eventually ends up in the lowest level of the structure with the leader that is closest to itself.

Nodes in the same group send eachother heartbeat messages. When a node goes silent for long enough, it is considered dead. When the leader goes silent, a new leader is elected through a distributed algorithm that was not entirely clear to me.

They run some simulations, and find that NICE performs slightly better than Narada in all metrics, including stress and path length, and does significantly better in terms of control overhead. But all simulations run with 128 nodes, and this seems kind of small for the kind of multicast that they're discussing (news feeds, stock ticker, etc.).

The one thing that I think could be seriously improved is the criteria for leader selection. Certainly, it is important to have a leader who is geographically close to all nodes in the group. However, it also seems important to have a leader who has a reliable internet connection. After all, if packets don't get to the leader, they will not get to the leaves; but if the leader is simply unreliable and not dead, then that particular node will continue to be the leader and to do a disservice to the rest of the nodes in his group. Some kind of redundant leadership might be nice in this situation.

SRM

This paper essentially describes an implementation of a reliability layer on top of a multicast protocol. The authors state that they will implement reliability by avoiding ACKs (and the ACK implosion problem), as well as sender-based reliability protocols (since this would mean the sender would have to track participation).

In the model that they use for multicast, participants in a multicast group can leave and join at will. All messages are deliverd to all participants, and all participants are both senders and receivers. The actual multicast is achieved using some sort of structure for message propagation; this structure can be a chain, a star, a tree, or anything else.

The first part of this paper is loss detection and recovery - each participant has a unique ID, and their messages are sent out with their ID as well as a sequential number. These numbers apply to distinct application-level messages, rather than packets. Whenever a node receives two non-sequential numbers, it assumes that there has been a loss, and prepares to request a recovery. It waits a random amount of time that is also dependent upon its "distance" from the sender.

By randomizing the amount of time that they wait, these nodes help avoid the synchronization problem. When a node sends a request for missing data, it is multicast like all other messages. This means that other nodes who want the same data can now back off (reset and double the timers on their requests) so as to have a minimal number of requests for each piece of missing data. Any node that hears a recovery request and has the requested data can respond to the request, thereby (hopefully) improving the response time for these requests. Not pictured here: RTT estimator (that doesn't require synchronized clocks), race condition avoidance (wait 3*RTT estimate).

They then do a couple of simulations of the algorithm's performance on various different topologies. They conclude that the number of retransmissions per lost data message is low, the amount of time to recover a lost message (divided by the RTT time) is low, and so forth. However, as they said in the paper, "We do not claim to be presenting realistic topologies or typical patterns of packet loss.". So in other words, this is all pretty suspect.

Lastly, they present a theoretical improvement where the wait time is adjusted based on the number of duplicate requests that a node sees, as well as the average request delay that it has. They then consider using TTL values to localize recovery requests and retransmissions (under the assumption that a loss implies a single failure somewhere in the topology rather than general network unreliability).

In general, I liked the first part of this paper a lot more than the rest of it. It presented an algorithm and implementation that was actually in use, and served a very specific purpose. As they state in the paper, it is unlikely that anyone will create a multicast application that satisfies everyone's needs, so it is better to have a more application-specific framework for multicast.

Monday, November 17, 2008

DTN

This paper is kind of short (I'm not complaining), and generally outlines difficulties with modern routing across the internet, especially with respect to wireless networks, ad-hoc networks, and sensor networks. It then suggests a partitioning system in order to deal with these difficulties, and presents some design decisions that it thinks would be best.

The first thing that is suggested is a partitioning into various independent networks, which can be addressed by name. One of these, of course, would be the internet as we know it today; the rest would be networks such as military ad-hoc networks. The key to communications between different networks would be DTN gateways. In general, a message would be destined for a particular network, with a particular address within the network. The DTN gateway then do late resolution, where the latter half of the address is resolved once the data enters the appropriate network.

I like some of these ideas, and must disagree with others. Partitioning the network into multiple address spaces is exactly the kind of thing that the original internet was supposed to overcome. The idea of IP over everything still has merit (in my personal opinion). However, I think the idea of using DTN gateways to control traffic moving into a wireless network is a relatively good idea, for reasons which are below.

The main idea behind DTN gateways is that they will control all traffic headed into their particular network. This includes long-term storage for data that is destined for nodes that are currently disconnected, as well as address translation/lookup. The long-term storage makes reliability much easier for long-term connections (essentially, the responsibility for delivering the data is delegated to the DTN gateway), and address translation is good for networks such as sensor networks, which like to use address compression to save bandwidth.

The authors have a security suggestion that involves edge routers in the internet verifying public-key encrypted information from end nodes. This seems like a generally bad idea, given the intensity of computation, and the ability to crack public-key encryption given sufficient time and resources (of course, it depends on the strength of the encryption). Still, they recognize the need for authentication, which is something that the internet doesn't do well right now.

On the whole, I'm not sure this paper provides very good solutions to today's problems, but it does seem to identify the problems very well.

DOT

The data-oriented transfer service is essentially an abstraction layer that abstracts the transfer of large files into a service that can be used via API calls at the application layer.

The basic concept behind DOT is that file transfer is used by a large number of protocols (HTTP, FTP, SMTP, etc.), each of which could benefit from the separation between the transfer of metadata and the transfer of the actual file. DOT suggests that each file transferred should get a unique ID, and that a sender should upload files to DOT, and then pass a unique ID (as well as some hints) to the receiver, who can then receive the file from DOT.

The suggested benefits of this scheme are:
1) abstraction - the underlying transfer method can be changed from a TCP connection to a torrenting scheme without any trouble.
2) caching - by caching the files at local nodes, you can reduce the overall bandwidth use of multiple file transfers
3) faster transfer - in theory DOT can split a file up and send it to its destination over multiple links, thereby getting the file there faster

The authors claim that this file transfer scheme will be widely applicable. Actually, I think this transfer scheme is very narrowly applicable. Certainly, for large e-mail attachments, this scheme has SOME merit (see discussion on expiration below). However, the authors provide nearly no performance data, and they avoid discussing performance for small files. This leads us to believe that for small files (such as webpages, which constitute a good chunk of internet traffic), the DOT scheme has too much overhead to actually be useful. In the world of the web, it is probably best suited to embedded videos, which are usually handled by content distribution networks anyway.

There is a serious flaw in DOT concerning the expiration and caching of data. Let's begin with caching. As we saw in the DNS caching paper, most of the requests that go out across the internet are to servers that are visited exactly once. The rest are to a few very popular servers. It is unreasonable to expect caching of large files to be effective in a scenario where caching of DNS entries is not effective. Sure, DOT could cache files for very popular websites, but this would in essence be duplicating work already done by CDNs.

The expiration of data in DOT is also a serious issue. According to this paper, data expires when the sender sends notice to the DOT that the data need no longer be stored. This makes the DOT vulnerable to denial of service attacks by clients that upload data, and then never mark it as deletable. Furthemore, consider the e-mail sender we discussed above. Suppose he sends an e-mail with a large attachment out to a large number of recipientns. When is the correct time to mark the attachment as deletable? It is likely to be AFTER all recipients have downloaded the attachment. Of course, it is very difficult for the sender to know when all the recipients have finished downloading the file; this would likely require some sort of acknowledgement of receipt of the file. And, of course, if some of the e-mails bounce, or a single address forwards to multiple recipients, this will make counting much more difficult. In other words, the protocol must be modified in order to safely deploy DOT. Or, the sender can just never mark his attachment for deletion.

I'm not actually sure what kind of client would regularly use DOT to transfer files. BitTorrent is fairly dominant in the large file transfer market, and the rest of file transfers are likely to be over HTTP, in which case the files are likely to be small, and the overhead is likely to be significant. I'm not sure what problem the DOT system solves.

Thursday, November 13, 2008

X-Trace

Basically, what is being proposed here is a system that will be present throughout all the different levels of any internet application so as to be able to help people debug problems that they encounter. The idea is that the internet is getting REALLY complex (which is very true - just look at how much material we have for this class), and therefore, we need some sort of tool that will help us debug problems.

X-Trace generally assumes that the flow of packets across the internet will generate a tree of causality. There's a very good example of this in the paper, with the root being a web browser, and the leaves being a tree of DNS requests, and a tree with an HTTP request that leads to an Apache call. X-Trace accomplishes this kind of tracing with some metadata that is kept by all levels of a request. The metadata includes a unique identifier, an operation identifier, a parent id, and some options. The unique identifier allows reports from the same request to be grouped together, whereas the parent id allows these reports to be arranged into the appropriate tree structure.

Thus, for an intermediate device, application, or library to implement X-Trace, it must essentially implement two different calls: PushNext() and PushDown(). PushNext() indicates that it is pushing the X-Trace metadata to the next logical node in the eventual X-Trace tree. PushDown() indicates that the metadata is being pushed to a lower layer.

Once everyone has implemented these calls, they can generate reports for each X-Trace request. These reports are sent to a local handler (for security purposes). The X-Trace metadata also has a tag that allows them to (optionally) send the report to the original requester. This step is optional, since a company like Wikipedia or Google may not want to reveal any information about their internal composition.

The major issue with X-Trace seems to be deployment. For some protocols, such as HTTP, deployment is made easy through the use of extension headers. For others, such as TCP, this becomes much more difficult, and requires some changes to the Linux kernel. For UDP, it is impossible without either changing the protocol itself, or adding a tiny layer on top of it. If X-Trace can be deployed across the internet, and protocols can be altered so as to support it, then it will be an invaluable debugging tool. However, the likelihood of such a deployment seems small.

Internet Measurement

This paper basically analyzes the performance of TCP across the internet. By using more than 30 different sites, they get a relatively good collection of flows across the internet. They measure a few different variables to determine whether or not TCP is working as well as it could be. Among these variables are packet reordering, packet replication, packet corruption, bottleneck bandwidth, packet loss, packet loss independence, and packet delay.

The goal here is to provide suggestions for ways to improve the TCP protocol (or implementation thereof). Let's start with packet reordering. They find that the only real effect that packet reordering has on TCP is that it can occasionally trigger retransmissions, when delaying the duplicate acks would have sufficed. To counteract this, they note that if the number of duplicate acks required to retransmit is reduced to 2, but the receiver waits 20 ms before transmitting a duplicate ack, it reduces the number of unnecessary retransmissions. However, if these two changes are not used together, each of them will actually degrade TCP performance, and so they consider it impractical to deploy.

They found a little bit of packet replication, but could not find a good explanation. They did notice, however, that it was site-specific. In terms of packet corruption, they noted that about 1 packet in 5000 arrives corrupted. With a 16 bit checksum, they argue that there is a one in 3000000 chance that a corrupted packet will be accepted, though this probably fails to take into account the nature of bit corruption, as well as the way the checksum is computed. They still recommend increasing the number of bits in the checksum. Corruption for ack packets seems much less common, possibly due to the smaller payload.

They introduced a new method for determining bottleneck bandwidth. In general, when two packets are sent in immediate succession, they sit behind eachother in the queue of some router, and the time interval between their arrivals is a good estimate for the time it took to send one of the two packets. They introduce the notion of computing this value at the RECEIVER, because this will remove any trouble caused by the delay that ACKs encounter on their way back. Furthermore, since they discover that links can be very asymmetric, this method does not need to compensate for the asymmetry. Since some internet connections are spread across several links, or load-balanced, they introduce the concept of not just doing this for packet pairs, but also doing it with 4 or 5 packets. Their results indicate that estimates made at the SENDER are only accurate around 60% of the time, which is a shame, because it means that the TCP sender cannot use the timing of ACKs to estimate a bottleneck bandwidth.

Next, they examine loss rates. In general, they note that most links have one of two states: quiescent or busy. In the quiescent state, there is no packet loss. In the busy state, the loss rate is usually a few percent. They notice that the state of a link is closely correlated with the time of day, and that a quiescent link usually stays in that state for several hours. They are able to conclude that packet losses are correlated with eachother, and are not independent. They suggest using RED in routers instead of tail-drop. They also note that packet loss is sometimes completely asymmetric (by measuring the packet loss rates in both directions).

They then measure the effectiveness of TCP retransmission. They note that Solaris has some problems estimating round-trip time, and exclude it from their main results. They then note that only about 1/6th of retransmissions are unavoidable, and suggest that SACKs can be used to cut down on the number of retransmissions resulting from coarse feedback.

In general, this paper is really thick, and contains a lot of different computations, as well as a lot of different suggestions for improving TCP. I think this paper is really important, because it analyzes just how well TCP is ACTUALLY working as a protocol across the internet. It's one thing to develop a model and to theorize that one algorithm will work better than another, but it's another to actually do measurements.

Wednesday, November 5, 2008

i3

This paper presents the Internet Indirection Infrastructure, which is basically yet another addressing scheme built on top of IP. It is, however, slightly more flexible than the last one I complained about. In particular, it makes multicast and load balancing fairly easy.

Once again, the paper suggests a Chord structure somewhere within the internet that will store a collection of identifiers of m bits. Any given machine can store a trigger, which in its simplest form stores an identifier and a location, in this DHT. Then, messages that are sent across the internet are sent to these identifiers instead of to IP addresses. The identifiers in the messages are matched to identifiers in the DHT (sometimes using partial matches), and sent to the IP addresses found in the associated triggers.

This system allows for some fancy stuff, such as using partial matches to do multicast (each recipient puts in a trigger with the same identifier) or load balancing (have k random bits at the end of the identifier, and you will be directed to the server with the largest partial match). It also allows for some degree of source routing, by using identifier stacks instead of just plain identifiers.

Once again, I would like to complain about the terrible things this is going to do to both latency and round trip time across the internet. The latency stretch incurred by this system seems to be on the order of 4-6, which I think is fairly unacceptable. Furthermore, either packets must always be routed through the i3 system (which has the triangle routing problem), or destination IPs will be cached at the sender, which makes the cool features such as multicast and source routing much more difficult. I also think that they have not really addressed the issues associated with source routing (you can tie up network resources without creating routing loops).

Interestingly, I think this system might work very well as an opt-in system rather than an internet standard. It enables some very cool features, but it has some clear drawbacks (i.e. QoS) that I'd rather not see when accessing CNN or Google.

DOA

This paper essentially suggests an alternate addressing scheme built on top of IP known DOA (delegation oriented architecture). It purports to address the problems created by NAT and firewalls - that is, they violate the principles of the internet. It claims to be better than IPv6, because the addresses do not reflect any kind of network topology (though what is wrong with that, I cannot tell).

The basic idea is that every machine anywhere has a 160-bit EID. These EIDs can be resolved to IPs through the use of a global lookup table. They suggest that this global lookup table be implemented as a DHT. Then, using a NAT or a firewall becomes simple. 1) You send your information (EID to IP mapping) to the NAT or firewall. 2) You place an entry in the EID mapping that maps your EID to the address of the NAT or firewall. After these two steps have been accomplished, a packet destined for your EID will make its way to the NAT or firewall through normal IP routing, and from there to you, with information that your NAT or firewall keeps in its local state.

This allows for some nice things, such as chaining intermediary elements, and the use of intermediate nodes that are not directly on the path between your own machine and other nodes on the internet.

The costs, though, are significant. This system is bad news for latency, since the average amount of time to resolve an EID was on the order of 138 ms. If that is just their test system, then it is reasonable to assume that this would increase (at least a little) for a larger DHT. After having read the paper on DNS performance, which points out that there are a large number of sites that are visited only once, this probably means that the quality of internet users' experiences would suffer at least a little. Most importantly, many connections are short-lived http connections, and users would like their pages to load quickly.

My own personal complaint about this system (which extends beyond the latency problem noted above) is that it essentially imposes a structure on the internet through its use of a DHT to do EID resolution. While it definitely solves the exhaustion of IPv4 addresses, so does IPv6. And they both impose some sort of structure on the internet. For IPv6, that structure is hierarchical addressing. For DOA, it is a DHT across the internet. The hierarchical system has been shown to have at least some merit (as in the case of DNS), whereas a DHT seems like it would suffer from frequent node losses which, while they would not affect correctness, would at the very least affect performance.

Sunday, November 2, 2008

DNS Caching

This paper analyzes the performance of DNS; in particular, it attempts to measure and account for DNS packets that are sent unnecessarily. While it seems like a daunting task, they do have some good results.

Their experimental setup consists of a single node that is placed at the location where a university network connects to the rest of the internet. The node then measures both DNS lookups and regular connections across the node. They can then correlate their TCP connections with their DNS lookups. Their results consist of 3 major findings.

1. A large number of DNS lookups go unanswered. They found approximately 4 million answered lookups, with approximately 1.2 query packets for each, and approximately 10 million query packets total. This means that the amount of DNS traffic can be reduced significantly by fixing problems such as incorrect names, or the "loop-back" issue. They suggest configuring DNS clients not to conduct lookups for names that are not fully qualified.

2. Cache-sharing is not that useful past about 25 clients. They measured this by having a hypothetical cache, and then grouping clients by source IP into hypothetical shared caches. The end result was that shifting from no sharing to sharing by 25 clients improved performance, but increasing to over a thousand clients had no significant performance improvement. They explained this with the "popularity" theory. More than half of the DNS lookups were to unique names that were queried only once. In other words, there is a very small number of servers that are very popular, and a very large number that are relatively unknown. Caching is only good for the popular servers, and therefore doesn't require an enormous number of clients to cache all the popular servers.

3. NS caching is incredibly useful. The miss rates on name servers is approximately 25%, which means that if NS caching was turned off, the number of DNS queries making it to the root would increase by a factor of four. Since this is highly undesirable, NS caching should remain as-is; this is in contrast to A caching, which was addressed in item number 2.

DNS

This paper assumes that the reader knows the details of name lookup as they stood in 1988, which makes it somewhat difficult to understand the finer details of the paper. Below is my best attempt at understanding this paper.

The problems with the existing system, which was a HOSTS.txt file, were twofold. First, the file was growing large and unwieldy. Second, the administrative difficulties associated with the file were growing more than linearly. As a result, the DNS system was proposed.

DNS consists of multiple zones. Each zone is responsible for DNS lookups within its own zone. DNS supports multiple levels of hierarchy by allowing each zone to delegate sub-zones, which are also responsible for their own DNS lookups. This has the advantage of allowing different organizations to have different number of levels within their naming schemes. Each sub-zone gets a label, which can be a collection of 8-bit values. The length of a label cannot exceed 63 values, and the total length cannot exceed 256 such values (though apparently, those maximums are subject to change).

The DNS system consists of both DNS servers and DNS resolvers. The choice to separate these two systems stems from the fact that the storage of data and the protocols for accessing it should be different. The servers store data, whereas the resolvers implement the algorithms that enable them to access and distribute the data. In other words, different resolvers can be used throughout different systems so as to better serve the end-users, who may have a wide variety of operating systems or architectures.

One of the most interesting things about this paper is that they specifically took into consideration the difficulty of adopting DNS. They decided to use a relatively lightweight system so as not to discourage people from using DNS. They also wanted the system to be easily extensible; these two goals lead to a delicate balancing act.

I think the most interesting part of this paper is the part where they discuss failures. In particular, the tendency of system administrators to just copy examples out of the documentation, or to make mistakes (such as switching the identifier and TTL fields), led to incredibly poor performance across the system. Furthermore, their desire to simply get a system that worked, rather than a system that worked well, undermined the efforts of the DNS designers. In particular, system administrators would apparently not have their top level servers in different networks, but claimed that they were not a single point of failure anyway.