Thursday, November 13, 2008

Internet Measurement

This paper basically analyzes the performance of TCP across the internet. By using more than 30 different sites, they get a relatively good collection of flows across the internet. They measure a few different variables to determine whether or not TCP is working as well as it could be. Among these variables are packet reordering, packet replication, packet corruption, bottleneck bandwidth, packet loss, packet loss independence, and packet delay.

The goal here is to provide suggestions for ways to improve the TCP protocol (or implementation thereof). Let's start with packet reordering. They find that the only real effect that packet reordering has on TCP is that it can occasionally trigger retransmissions, when delaying the duplicate acks would have sufficed. To counteract this, they note that if the number of duplicate acks required to retransmit is reduced to 2, but the receiver waits 20 ms before transmitting a duplicate ack, it reduces the number of unnecessary retransmissions. However, if these two changes are not used together, each of them will actually degrade TCP performance, and so they consider it impractical to deploy.

They found a little bit of packet replication, but could not find a good explanation. They did notice, however, that it was site-specific. In terms of packet corruption, they noted that about 1 packet in 5000 arrives corrupted. With a 16 bit checksum, they argue that there is a one in 3000000 chance that a corrupted packet will be accepted, though this probably fails to take into account the nature of bit corruption, as well as the way the checksum is computed. They still recommend increasing the number of bits in the checksum. Corruption for ack packets seems much less common, possibly due to the smaller payload.

They introduced a new method for determining bottleneck bandwidth. In general, when two packets are sent in immediate succession, they sit behind eachother in the queue of some router, and the time interval between their arrivals is a good estimate for the time it took to send one of the two packets. They introduce the notion of computing this value at the RECEIVER, because this will remove any trouble caused by the delay that ACKs encounter on their way back. Furthermore, since they discover that links can be very asymmetric, this method does not need to compensate for the asymmetry. Since some internet connections are spread across several links, or load-balanced, they introduce the concept of not just doing this for packet pairs, but also doing it with 4 or 5 packets. Their results indicate that estimates made at the SENDER are only accurate around 60% of the time, which is a shame, because it means that the TCP sender cannot use the timing of ACKs to estimate a bottleneck bandwidth.

Next, they examine loss rates. In general, they note that most links have one of two states: quiescent or busy. In the quiescent state, there is no packet loss. In the busy state, the loss rate is usually a few percent. They notice that the state of a link is closely correlated with the time of day, and that a quiescent link usually stays in that state for several hours. They are able to conclude that packet losses are correlated with eachother, and are not independent. They suggest using RED in routers instead of tail-drop. They also note that packet loss is sometimes completely asymmetric (by measuring the packet loss rates in both directions).

They then measure the effectiveness of TCP retransmission. They note that Solaris has some problems estimating round-trip time, and exclude it from their main results. They then note that only about 1/6th of retransmissions are unavoidable, and suggest that SACKs can be used to cut down on the number of retransmissions resulting from coarse feedback.

In general, this paper is really thick, and contains a lot of different computations, as well as a lot of different suggestions for improving TCP. I think this paper is really important, because it analyzes just how well TCP is ACTUALLY working as a protocol across the internet. It's one thing to develop a model and to theorize that one algorithm will work better than another, but it's another to actually do measurements.

Wednesday, November 5, 2008

i3

This paper presents the Internet Indirection Infrastructure, which is basically yet another addressing scheme built on top of IP. It is, however, slightly more flexible than the last one I complained about. In particular, it makes multicast and load balancing fairly easy.

Once again, the paper suggests a Chord structure somewhere within the internet that will store a collection of identifiers of m bits. Any given machine can store a trigger, which in its simplest form stores an identifier and a location, in this DHT. Then, messages that are sent across the internet are sent to these identifiers instead of to IP addresses. The identifiers in the messages are matched to identifiers in the DHT (sometimes using partial matches), and sent to the IP addresses found in the associated triggers.

This system allows for some fancy stuff, such as using partial matches to do multicast (each recipient puts in a trigger with the same identifier) or load balancing (have k random bits at the end of the identifier, and you will be directed to the server with the largest partial match). It also allows for some degree of source routing, by using identifier stacks instead of just plain identifiers.

Once again, I would like to complain about the terrible things this is going to do to both latency and round trip time across the internet. The latency stretch incurred by this system seems to be on the order of 4-6, which I think is fairly unacceptable. Furthermore, either packets must always be routed through the i3 system (which has the triangle routing problem), or destination IPs will be cached at the sender, which makes the cool features such as multicast and source routing much more difficult. I also think that they have not really addressed the issues associated with source routing (you can tie up network resources without creating routing loops).

Interestingly, I think this system might work very well as an opt-in system rather than an internet standard. It enables some very cool features, but it has some clear drawbacks (i.e. QoS) that I'd rather not see when accessing CNN or Google.

DOA

This paper essentially suggests an alternate addressing scheme built on top of IP known DOA (delegation oriented architecture). It purports to address the problems created by NAT and firewalls - that is, they violate the principles of the internet. It claims to be better than IPv6, because the addresses do not reflect any kind of network topology (though what is wrong with that, I cannot tell).

The basic idea is that every machine anywhere has a 160-bit EID. These EIDs can be resolved to IPs through the use of a global lookup table. They suggest that this global lookup table be implemented as a DHT. Then, using a NAT or a firewall becomes simple. 1) You send your information (EID to IP mapping) to the NAT or firewall. 2) You place an entry in the EID mapping that maps your EID to the address of the NAT or firewall. After these two steps have been accomplished, a packet destined for your EID will make its way to the NAT or firewall through normal IP routing, and from there to you, with information that your NAT or firewall keeps in its local state.

This allows for some nice things, such as chaining intermediary elements, and the use of intermediate nodes that are not directly on the path between your own machine and other nodes on the internet.

The costs, though, are significant. This system is bad news for latency, since the average amount of time to resolve an EID was on the order of 138 ms. If that is just their test system, then it is reasonable to assume that this would increase (at least a little) for a larger DHT. After having read the paper on DNS performance, which points out that there are a large number of sites that are visited only once, this probably means that the quality of internet users' experiences would suffer at least a little. Most importantly, many connections are short-lived http connections, and users would like their pages to load quickly.

My own personal complaint about this system (which extends beyond the latency problem noted above) is that it essentially imposes a structure on the internet through its use of a DHT to do EID resolution. While it definitely solves the exhaustion of IPv4 addresses, so does IPv6. And they both impose some sort of structure on the internet. For IPv6, that structure is hierarchical addressing. For DOA, it is a DHT across the internet. The hierarchical system has been shown to have at least some merit (as in the case of DNS), whereas a DHT seems like it would suffer from frequent node losses which, while they would not affect correctness, would at the very least affect performance.

Sunday, November 2, 2008

DNS Caching

This paper analyzes the performance of DNS; in particular, it attempts to measure and account for DNS packets that are sent unnecessarily. While it seems like a daunting task, they do have some good results.

Their experimental setup consists of a single node that is placed at the location where a university network connects to the rest of the internet. The node then measures both DNS lookups and regular connections across the node. They can then correlate their TCP connections with their DNS lookups. Their results consist of 3 major findings.

1. A large number of DNS lookups go unanswered. They found approximately 4 million answered lookups, with approximately 1.2 query packets for each, and approximately 10 million query packets total. This means that the amount of DNS traffic can be reduced significantly by fixing problems such as incorrect names, or the "loop-back" issue. They suggest configuring DNS clients not to conduct lookups for names that are not fully qualified.

2. Cache-sharing is not that useful past about 25 clients. They measured this by having a hypothetical cache, and then grouping clients by source IP into hypothetical shared caches. The end result was that shifting from no sharing to sharing by 25 clients improved performance, but increasing to over a thousand clients had no significant performance improvement. They explained this with the "popularity" theory. More than half of the DNS lookups were to unique names that were queried only once. In other words, there is a very small number of servers that are very popular, and a very large number that are relatively unknown. Caching is only good for the popular servers, and therefore doesn't require an enormous number of clients to cache all the popular servers.

3. NS caching is incredibly useful. The miss rates on name servers is approximately 25%, which means that if NS caching was turned off, the number of DNS queries making it to the root would increase by a factor of four. Since this is highly undesirable, NS caching should remain as-is; this is in contrast to A caching, which was addressed in item number 2.

DNS

This paper assumes that the reader knows the details of name lookup as they stood in 1988, which makes it somewhat difficult to understand the finer details of the paper. Below is my best attempt at understanding this paper.

The problems with the existing system, which was a HOSTS.txt file, were twofold. First, the file was growing large and unwieldy. Second, the administrative difficulties associated with the file were growing more than linearly. As a result, the DNS system was proposed.

DNS consists of multiple zones. Each zone is responsible for DNS lookups within its own zone. DNS supports multiple levels of hierarchy by allowing each zone to delegate sub-zones, which are also responsible for their own DNS lookups. This has the advantage of allowing different organizations to have different number of levels within their naming schemes. Each sub-zone gets a label, which can be a collection of 8-bit values. The length of a label cannot exceed 63 values, and the total length cannot exceed 256 such values (though apparently, those maximums are subject to change).

The DNS system consists of both DNS servers and DNS resolvers. The choice to separate these two systems stems from the fact that the storage of data and the protocols for accessing it should be different. The servers store data, whereas the resolvers implement the algorithms that enable them to access and distribute the data. In other words, different resolvers can be used throughout different systems so as to better serve the end-users, who may have a wide variety of operating systems or architectures.

One of the most interesting things about this paper is that they specifically took into consideration the difficulty of adopting DNS. They decided to use a relatively lightweight system so as not to discourage people from using DNS. They also wanted the system to be easily extensible; these two goals lead to a delicate balancing act.

I think the most interesting part of this paper is the part where they discuss failures. In particular, the tendency of system administrators to just copy examples out of the documentation, or to make mistakes (such as switching the identifier and TTL fields), led to incredibly poor performance across the system. Furthermore, their desire to simply get a system that worked, rather than a system that worked well, undermined the efforts of the DNS designers. In particular, system administrators would apparently not have their top level servers in different networks, but claimed that they were not a single point of failure anyway.

Thursday, October 30, 2008

Chord

Chord is a DHT (see previous post). The basic idea is that it provides a single function: lookup(key). The keys are generated from the SHA-1 hash function, which more or less guarantees their randomness. Each node is also given a key using the SHA-1 hash function. The nodes are arranged in a circle by hash value, and each node "stores" all keys that are larger than its own hash value, but smaller than the hash values of all other nodes (with wrapping).

The way queries are distributed through the network is with pointers from one node to various neighbors. Each node has a successor, as well as a list of "fingers". The fingers point to nodes that are halfway around the DHT, a quarter of the way around, an eighth, and so forth. To process a query, a node chooses the finger with the greatest hash value that does not exceed the value of the requested key, and forwards the request to that node. The request then propagates through the chain until it reaches the correct node.

Note that in order to maintain correctness, the fingers need not exist, and only the successor pointers need to be accurate. The successors and fingers are updated by a periodic stabilization process.

They tested the network in simulation, with 1000 nodes, and then again in practice, with 10. In both cases, they found that it helped to have several virtual nodes at each real machine, as a way of spreading out the keys across more nodes. They found that the number of hops that a query required was indeed O(log(n)), and that the network was indeed resilient to massive numbers of node failures.

The main complaint I have about Chord is that they expect applications to build redundancy on top of this system, probably by creating n keys for each piece of data that they want to store (where n is the number of copies), and then either 1) looking them up one at a time until one lookup succeeds or 2) looking them all up at once, and taking the first result. The first one wastes a lot of the application's time in the case of a failure, and the second one wastes a lot of bandwidth.

I think it would be interesting to implement Chord with redundancy built in to it. Not only would applications not have to worry about implementing this on top of Chord, but there might be good performance characteristics as well. Instead of searching for the one and only copy of a given key, you could search for the copy of the key that can be found by searching the "closest" finger. Alternatively, you could store link quality data, and transmit your query over the link with the lowest latency. Or, if there is data being stored at the nodes, perhaps you wish to find the copy of the data with the highest bandwidth to the origin of the request. But all of these would need some sort of built-in redundancy.

Wednesday, October 29, 2008

DHT Survey

This article is exactly as described: a survey of DHTs. The article first describes a few of the recent approaches to P2P systems, and why they are poor choices for scalability. Among current architectures are trees; trees have a problem with failures of high-level nodes, which can do a lot of damage to the system. Another approach is a simple broadcast-to-all-neighbors mechanism, which ends up sending too many messages in increasingly large systems. The superpeer approach used by KaZaA also has single point of failure problems.

Thus, a modern area of research is the distributed hash table (DHT), which implements one function (lookup), and in which no node is more important than any other node. Among the various implementations of the DHT are Chord, Pastry, Tapestry, Kademlia, and CAN. Chord uses a circular linked list, with "fingers", which point halfway around the circle, a quarter of the way around the circle, an eight of the way around the circle, and so forth.

Pastry, Tapestry, and Kademlia use a tree structure. In the tree structure, they store several leaves (nearby nodes), as well as a collection of nodes that cover the space of the first digit, a collection that cover the space of the first two digits, a collection for the first three digits, and so forth. Thus at any node, you can find a link to another node which shares the first n digits, where n is determined by the implementation.

CAN uses n-dimensional space to represent its nodes. Each node has a pointer to any neighbor that it shares an (n-1)-dimensional hyperplane with. New nodes choose a random location in hyperspace, and then force an existing node to split its dimensions in half. Lookup requests are routed along a straight-as-possible line in hyperspace.

The key point of all of these algorithms is that there is some sort of geometry, some sort of identifier, and some sort of distance function which can tell you the difference between two given identifiers. The key point of the DHT is to provide a reliable way to lookup keys independent of joins and disconnects of the machines that create the DHT.