Tuesday, November 25, 2008

LATE

No, not this post. That's the name of the algorithm for deciding which tasks to speculatively execute. That is the essence of this paper, and unlike most of the other papers we've read, which have, for the most part, either speculated about or extrapolated their results to the internet, it has a lot of actual experiments to back it up.

The basic idea is that the MapReduce scheduler has a few underlying assumptions that cause problems when working on heterogeneous systems. Among these assumptions are that all nodes execute tasks at more or less the same pace, and that all phases of reduction operate at roughly the same speed.

The first assumption leads to problems when detecting stragglers. A node that runs twice as slow as average (possibly due to resource contention) is treated essentially the same as a node that runs ten times as slow as average. In the original scheduler, all "stragglers" were treated the same, and were assigned to idle nodes based on locality. The LATE scheduler prioritizes these nodes by estimated finish time, which they calculate by using the work completion estimate divided by the time spent to guess a rate at which the node processes work, and then to extrapolate its time remaining.

The second assumption can lead to scenarios in which a certain percentage of tasks raise the average significantly. For example, the copying, sorting, and reducing phases are considered to each be 1/3rd of the work. However, the copying phase usually takes the longest by a significant margin. In particular if a job finishes almost immediately after the copying phase, this means that some jobs will be at 100% complete, while others will not yet be at 33%. This disparity causes a spike in the average % completion, which then labels all uncompleted jobs stragglers, thereby causing excessive speculative executions.

LATE also fixes a problem that arises when jobs do not finish in waves. Under normal circumstances, jobs all take roughly the same amount of time, and finish roughly at the same time. Thus, a job that is taking longer than others to finish can safely be considered a "straggler". But in a heterogeneous system, this is not necessarily true, which is why the LATE scheduler's completion time heuristic performs better than the previous heuristic, which just measured progress.

Their results show that they almost always do better than both the original scheduler, as well as the scheduler with no speculative execution for the best, worst, and average cases. In fact, they do better by a factor of two in some cases. There is some talk about extending the estimated completion time calculation to understand different "phases", but that seems needlessly complicated. However, it is noted that this scheduler may have problems if early phases run faster than late phases.

Monday, November 24, 2008

Policy Aware Switching

I would just like to start by saying that while this paper presents a nice system, it is relatively long by virtue (or vice) of verbosity.

This paper essentially presents a system similar to the DOA we studied a few weeks ago. The idea is that there are p-switches that delegate the standard middlebox functions to boxes that are not on the path into the datacenter. This usage is slightly more easily applied, because it does not require a new naming scheme, and can be implemented one datacenter at a time.

Each p-switch has a configuration that basically uses the source MAC address and 5-tuple of each packet to decide where the packet should go next. The 5-tuple is the list of 5 datapoints: source ip, source port, destination ip, destination port, transport protocol. The way the configuration works is that each 5-tuple, source pair has a "next hop" associated with it (use of wildcards allowed). Thus, a packet that comes from (for example) a firewall, even though none of its 5-tuple fields has changed, can still be directed to the appropriate next hop (for example, a load balancer) because the p-switch can deduce its source.

This system is also capable of succeeding in situations where the packets are addressed to one of the middleboxes (say, a load balancer) which will then rewrite its IP header by using a clever combination of wildcards and extra rules to make sure that each packet passes through the same middleboxes on the way in and on the way out.

Rules can have multiple values for the "next hop" field, allowing it to load-balance the work of a single middlebox across multiple middleboxes of the same type. To make sure that all packets from a single flow go through the same middlebox, the p-switches use consistent hashing.

I have nothing bad to say about this paper, except that the recovery time from failures appears to be as large as 3 seconds. That's a huge amount of downtime or lost packets for a datacenter. Of course, that may be a result of the implementation, but it would be nice to see that number reduced.

Thursday, November 20, 2008

NICE

Note: I lost internet in between starting and finishing this post. I've had internet for a minute or two right now.

This paper essentially presents a protocol for accomplishing multicast in which there is a designated sender, and all other nodes are receivers; in other words, they are subscribing to the data being sent.

To accomplish this in an efficient manner, the authors suggest a dynamically constructed tree of bounded degree. Given a constant k, they wish to bound the degree by k and 3k-1. This means that it is always possible to split a node that is too large, and it is always possible to merge two nodes if one has become too small (if their sum exceeds 3k-1, split the large one, and then merge with either of the resulting halves). This is essentially a concept stolen from B-trees.

The basic concept behind multicast in this setup is that messages propagate from the root to their leaves. Each new node joins at the lowest level of the tree. Each group of nodes in the tree has a leader, which is not only a member of the lowest level, but also a member of the level one above. This continues recursively until there is only 1 root. The leader of a group is its geographic center; this enables a new node to find a group in O(log(n)) time by querying nodes that are successively lower in the hierarchy for their list of children. With each query, the new node chooses the child closest to itself, and eventually ends up in the lowest level of the structure with the leader that is closest to itself.

Nodes in the same group send eachother heartbeat messages. When a node goes silent for long enough, it is considered dead. When the leader goes silent, a new leader is elected through a distributed algorithm that was not entirely clear to me.

They run some simulations, and find that NICE performs slightly better than Narada in all metrics, including stress and path length, and does significantly better in terms of control overhead. But all simulations run with 128 nodes, and this seems kind of small for the kind of multicast that they're discussing (news feeds, stock ticker, etc.).

The one thing that I think could be seriously improved is the criteria for leader selection. Certainly, it is important to have a leader who is geographically close to all nodes in the group. However, it also seems important to have a leader who has a reliable internet connection. After all, if packets don't get to the leader, they will not get to the leaves; but if the leader is simply unreliable and not dead, then that particular node will continue to be the leader and to do a disservice to the rest of the nodes in his group. Some kind of redundant leadership might be nice in this situation.

SRM

This paper essentially describes an implementation of a reliability layer on top of a multicast protocol. The authors state that they will implement reliability by avoiding ACKs (and the ACK implosion problem), as well as sender-based reliability protocols (since this would mean the sender would have to track participation).

In the model that they use for multicast, participants in a multicast group can leave and join at will. All messages are deliverd to all participants, and all participants are both senders and receivers. The actual multicast is achieved using some sort of structure for message propagation; this structure can be a chain, a star, a tree, or anything else.

The first part of this paper is loss detection and recovery - each participant has a unique ID, and their messages are sent out with their ID as well as a sequential number. These numbers apply to distinct application-level messages, rather than packets. Whenever a node receives two non-sequential numbers, it assumes that there has been a loss, and prepares to request a recovery. It waits a random amount of time that is also dependent upon its "distance" from the sender.

By randomizing the amount of time that they wait, these nodes help avoid the synchronization problem. When a node sends a request for missing data, it is multicast like all other messages. This means that other nodes who want the same data can now back off (reset and double the timers on their requests) so as to have a minimal number of requests for each piece of missing data. Any node that hears a recovery request and has the requested data can respond to the request, thereby (hopefully) improving the response time for these requests. Not pictured here: RTT estimator (that doesn't require synchronized clocks), race condition avoidance (wait 3*RTT estimate).

They then do a couple of simulations of the algorithm's performance on various different topologies. They conclude that the number of retransmissions per lost data message is low, the amount of time to recover a lost message (divided by the RTT time) is low, and so forth. However, as they said in the paper, "We do not claim to be presenting realistic topologies or typical patterns of packet loss.". So in other words, this is all pretty suspect.

Lastly, they present a theoretical improvement where the wait time is adjusted based on the number of duplicate requests that a node sees, as well as the average request delay that it has. They then consider using TTL values to localize recovery requests and retransmissions (under the assumption that a loss implies a single failure somewhere in the topology rather than general network unreliability).

In general, I liked the first part of this paper a lot more than the rest of it. It presented an algorithm and implementation that was actually in use, and served a very specific purpose. As they state in the paper, it is unlikely that anyone will create a multicast application that satisfies everyone's needs, so it is better to have a more application-specific framework for multicast.

Monday, November 17, 2008

DTN

This paper is kind of short (I'm not complaining), and generally outlines difficulties with modern routing across the internet, especially with respect to wireless networks, ad-hoc networks, and sensor networks. It then suggests a partitioning system in order to deal with these difficulties, and presents some design decisions that it thinks would be best.

The first thing that is suggested is a partitioning into various independent networks, which can be addressed by name. One of these, of course, would be the internet as we know it today; the rest would be networks such as military ad-hoc networks. The key to communications between different networks would be DTN gateways. In general, a message would be destined for a particular network, with a particular address within the network. The DTN gateway then do late resolution, where the latter half of the address is resolved once the data enters the appropriate network.

I like some of these ideas, and must disagree with others. Partitioning the network into multiple address spaces is exactly the kind of thing that the original internet was supposed to overcome. The idea of IP over everything still has merit (in my personal opinion). However, I think the idea of using DTN gateways to control traffic moving into a wireless network is a relatively good idea, for reasons which are below.

The main idea behind DTN gateways is that they will control all traffic headed into their particular network. This includes long-term storage for data that is destined for nodes that are currently disconnected, as well as address translation/lookup. The long-term storage makes reliability much easier for long-term connections (essentially, the responsibility for delivering the data is delegated to the DTN gateway), and address translation is good for networks such as sensor networks, which like to use address compression to save bandwidth.

The authors have a security suggestion that involves edge routers in the internet verifying public-key encrypted information from end nodes. This seems like a generally bad idea, given the intensity of computation, and the ability to crack public-key encryption given sufficient time and resources (of course, it depends on the strength of the encryption). Still, they recognize the need for authentication, which is something that the internet doesn't do well right now.

On the whole, I'm not sure this paper provides very good solutions to today's problems, but it does seem to identify the problems very well.

DOT

The data-oriented transfer service is essentially an abstraction layer that abstracts the transfer of large files into a service that can be used via API calls at the application layer.

The basic concept behind DOT is that file transfer is used by a large number of protocols (HTTP, FTP, SMTP, etc.), each of which could benefit from the separation between the transfer of metadata and the transfer of the actual file. DOT suggests that each file transferred should get a unique ID, and that a sender should upload files to DOT, and then pass a unique ID (as well as some hints) to the receiver, who can then receive the file from DOT.

The suggested benefits of this scheme are:
1) abstraction - the underlying transfer method can be changed from a TCP connection to a torrenting scheme without any trouble.
2) caching - by caching the files at local nodes, you can reduce the overall bandwidth use of multiple file transfers
3) faster transfer - in theory DOT can split a file up and send it to its destination over multiple links, thereby getting the file there faster

The authors claim that this file transfer scheme will be widely applicable. Actually, I think this transfer scheme is very narrowly applicable. Certainly, for large e-mail attachments, this scheme has SOME merit (see discussion on expiration below). However, the authors provide nearly no performance data, and they avoid discussing performance for small files. This leads us to believe that for small files (such as webpages, which constitute a good chunk of internet traffic), the DOT scheme has too much overhead to actually be useful. In the world of the web, it is probably best suited to embedded videos, which are usually handled by content distribution networks anyway.

There is a serious flaw in DOT concerning the expiration and caching of data. Let's begin with caching. As we saw in the DNS caching paper, most of the requests that go out across the internet are to servers that are visited exactly once. The rest are to a few very popular servers. It is unreasonable to expect caching of large files to be effective in a scenario where caching of DNS entries is not effective. Sure, DOT could cache files for very popular websites, but this would in essence be duplicating work already done by CDNs.

The expiration of data in DOT is also a serious issue. According to this paper, data expires when the sender sends notice to the DOT that the data need no longer be stored. This makes the DOT vulnerable to denial of service attacks by clients that upload data, and then never mark it as deletable. Furthemore, consider the e-mail sender we discussed above. Suppose he sends an e-mail with a large attachment out to a large number of recipientns. When is the correct time to mark the attachment as deletable? It is likely to be AFTER all recipients have downloaded the attachment. Of course, it is very difficult for the sender to know when all the recipients have finished downloading the file; this would likely require some sort of acknowledgement of receipt of the file. And, of course, if some of the e-mails bounce, or a single address forwards to multiple recipients, this will make counting much more difficult. In other words, the protocol must be modified in order to safely deploy DOT. Or, the sender can just never mark his attachment for deletion.

I'm not actually sure what kind of client would regularly use DOT to transfer files. BitTorrent is fairly dominant in the large file transfer market, and the rest of file transfers are likely to be over HTTP, in which case the files are likely to be small, and the overhead is likely to be significant. I'm not sure what problem the DOT system solves.

Thursday, November 13, 2008

X-Trace

Basically, what is being proposed here is a system that will be present throughout all the different levels of any internet application so as to be able to help people debug problems that they encounter. The idea is that the internet is getting REALLY complex (which is very true - just look at how much material we have for this class), and therefore, we need some sort of tool that will help us debug problems.

X-Trace generally assumes that the flow of packets across the internet will generate a tree of causality. There's a very good example of this in the paper, with the root being a web browser, and the leaves being a tree of DNS requests, and a tree with an HTTP request that leads to an Apache call. X-Trace accomplishes this kind of tracing with some metadata that is kept by all levels of a request. The metadata includes a unique identifier, an operation identifier, a parent id, and some options. The unique identifier allows reports from the same request to be grouped together, whereas the parent id allows these reports to be arranged into the appropriate tree structure.

Thus, for an intermediate device, application, or library to implement X-Trace, it must essentially implement two different calls: PushNext() and PushDown(). PushNext() indicates that it is pushing the X-Trace metadata to the next logical node in the eventual X-Trace tree. PushDown() indicates that the metadata is being pushed to a lower layer.

Once everyone has implemented these calls, they can generate reports for each X-Trace request. These reports are sent to a local handler (for security purposes). The X-Trace metadata also has a tag that allows them to (optionally) send the report to the original requester. This step is optional, since a company like Wikipedia or Google may not want to reveal any information about their internal composition.

The major issue with X-Trace seems to be deployment. For some protocols, such as HTTP, deployment is made easy through the use of extension headers. For others, such as TCP, this becomes much more difficult, and requires some changes to the Linux kernel. For UDP, it is impossible without either changing the protocol itself, or adding a tiny layer on top of it. If X-Trace can be deployed across the internet, and protocols can be altered so as to support it, then it will be an invaluable debugging tool. However, the likelihood of such a deployment seems small.

Internet Measurement

This paper basically analyzes the performance of TCP across the internet. By using more than 30 different sites, they get a relatively good collection of flows across the internet. They measure a few different variables to determine whether or not TCP is working as well as it could be. Among these variables are packet reordering, packet replication, packet corruption, bottleneck bandwidth, packet loss, packet loss independence, and packet delay.

The goal here is to provide suggestions for ways to improve the TCP protocol (or implementation thereof). Let's start with packet reordering. They find that the only real effect that packet reordering has on TCP is that it can occasionally trigger retransmissions, when delaying the duplicate acks would have sufficed. To counteract this, they note that if the number of duplicate acks required to retransmit is reduced to 2, but the receiver waits 20 ms before transmitting a duplicate ack, it reduces the number of unnecessary retransmissions. However, if these two changes are not used together, each of them will actually degrade TCP performance, and so they consider it impractical to deploy.

They found a little bit of packet replication, but could not find a good explanation. They did notice, however, that it was site-specific. In terms of packet corruption, they noted that about 1 packet in 5000 arrives corrupted. With a 16 bit checksum, they argue that there is a one in 3000000 chance that a corrupted packet will be accepted, though this probably fails to take into account the nature of bit corruption, as well as the way the checksum is computed. They still recommend increasing the number of bits in the checksum. Corruption for ack packets seems much less common, possibly due to the smaller payload.

They introduced a new method for determining bottleneck bandwidth. In general, when two packets are sent in immediate succession, they sit behind eachother in the queue of some router, and the time interval between their arrivals is a good estimate for the time it took to send one of the two packets. They introduce the notion of computing this value at the RECEIVER, because this will remove any trouble caused by the delay that ACKs encounter on their way back. Furthermore, since they discover that links can be very asymmetric, this method does not need to compensate for the asymmetry. Since some internet connections are spread across several links, or load-balanced, they introduce the concept of not just doing this for packet pairs, but also doing it with 4 or 5 packets. Their results indicate that estimates made at the SENDER are only accurate around 60% of the time, which is a shame, because it means that the TCP sender cannot use the timing of ACKs to estimate a bottleneck bandwidth.

Next, they examine loss rates. In general, they note that most links have one of two states: quiescent or busy. In the quiescent state, there is no packet loss. In the busy state, the loss rate is usually a few percent. They notice that the state of a link is closely correlated with the time of day, and that a quiescent link usually stays in that state for several hours. They are able to conclude that packet losses are correlated with eachother, and are not independent. They suggest using RED in routers instead of tail-drop. They also note that packet loss is sometimes completely asymmetric (by measuring the packet loss rates in both directions).

They then measure the effectiveness of TCP retransmission. They note that Solaris has some problems estimating round-trip time, and exclude it from their main results. They then note that only about 1/6th of retransmissions are unavoidable, and suggest that SACKs can be used to cut down on the number of retransmissions resulting from coarse feedback.

In general, this paper is really thick, and contains a lot of different computations, as well as a lot of different suggestions for improving TCP. I think this paper is really important, because it analyzes just how well TCP is ACTUALLY working as a protocol across the internet. It's one thing to develop a model and to theorize that one algorithm will work better than another, but it's another to actually do measurements.

Wednesday, November 5, 2008

i3

This paper presents the Internet Indirection Infrastructure, which is basically yet another addressing scheme built on top of IP. It is, however, slightly more flexible than the last one I complained about. In particular, it makes multicast and load balancing fairly easy.

Once again, the paper suggests a Chord structure somewhere within the internet that will store a collection of identifiers of m bits. Any given machine can store a trigger, which in its simplest form stores an identifier and a location, in this DHT. Then, messages that are sent across the internet are sent to these identifiers instead of to IP addresses. The identifiers in the messages are matched to identifiers in the DHT (sometimes using partial matches), and sent to the IP addresses found in the associated triggers.

This system allows for some fancy stuff, such as using partial matches to do multicast (each recipient puts in a trigger with the same identifier) or load balancing (have k random bits at the end of the identifier, and you will be directed to the server with the largest partial match). It also allows for some degree of source routing, by using identifier stacks instead of just plain identifiers.

Once again, I would like to complain about the terrible things this is going to do to both latency and round trip time across the internet. The latency stretch incurred by this system seems to be on the order of 4-6, which I think is fairly unacceptable. Furthermore, either packets must always be routed through the i3 system (which has the triangle routing problem), or destination IPs will be cached at the sender, which makes the cool features such as multicast and source routing much more difficult. I also think that they have not really addressed the issues associated with source routing (you can tie up network resources without creating routing loops).

Interestingly, I think this system might work very well as an opt-in system rather than an internet standard. It enables some very cool features, but it has some clear drawbacks (i.e. QoS) that I'd rather not see when accessing CNN or Google.

DOA

This paper essentially suggests an alternate addressing scheme built on top of IP known DOA (delegation oriented architecture). It purports to address the problems created by NAT and firewalls - that is, they violate the principles of the internet. It claims to be better than IPv6, because the addresses do not reflect any kind of network topology (though what is wrong with that, I cannot tell).

The basic idea is that every machine anywhere has a 160-bit EID. These EIDs can be resolved to IPs through the use of a global lookup table. They suggest that this global lookup table be implemented as a DHT. Then, using a NAT or a firewall becomes simple. 1) You send your information (EID to IP mapping) to the NAT or firewall. 2) You place an entry in the EID mapping that maps your EID to the address of the NAT or firewall. After these two steps have been accomplished, a packet destined for your EID will make its way to the NAT or firewall through normal IP routing, and from there to you, with information that your NAT or firewall keeps in its local state.

This allows for some nice things, such as chaining intermediary elements, and the use of intermediate nodes that are not directly on the path between your own machine and other nodes on the internet.

The costs, though, are significant. This system is bad news for latency, since the average amount of time to resolve an EID was on the order of 138 ms. If that is just their test system, then it is reasonable to assume that this would increase (at least a little) for a larger DHT. After having read the paper on DNS performance, which points out that there are a large number of sites that are visited only once, this probably means that the quality of internet users' experiences would suffer at least a little. Most importantly, many connections are short-lived http connections, and users would like their pages to load quickly.

My own personal complaint about this system (which extends beyond the latency problem noted above) is that it essentially imposes a structure on the internet through its use of a DHT to do EID resolution. While it definitely solves the exhaustion of IPv4 addresses, so does IPv6. And they both impose some sort of structure on the internet. For IPv6, that structure is hierarchical addressing. For DOA, it is a DHT across the internet. The hierarchical system has been shown to have at least some merit (as in the case of DNS), whereas a DHT seems like it would suffer from frequent node losses which, while they would not affect correctness, would at the very least affect performance.

Sunday, November 2, 2008

DNS Caching

This paper analyzes the performance of DNS; in particular, it attempts to measure and account for DNS packets that are sent unnecessarily. While it seems like a daunting task, they do have some good results.

Their experimental setup consists of a single node that is placed at the location where a university network connects to the rest of the internet. The node then measures both DNS lookups and regular connections across the node. They can then correlate their TCP connections with their DNS lookups. Their results consist of 3 major findings.

1. A large number of DNS lookups go unanswered. They found approximately 4 million answered lookups, with approximately 1.2 query packets for each, and approximately 10 million query packets total. This means that the amount of DNS traffic can be reduced significantly by fixing problems such as incorrect names, or the "loop-back" issue. They suggest configuring DNS clients not to conduct lookups for names that are not fully qualified.

2. Cache-sharing is not that useful past about 25 clients. They measured this by having a hypothetical cache, and then grouping clients by source IP into hypothetical shared caches. The end result was that shifting from no sharing to sharing by 25 clients improved performance, but increasing to over a thousand clients had no significant performance improvement. They explained this with the "popularity" theory. More than half of the DNS lookups were to unique names that were queried only once. In other words, there is a very small number of servers that are very popular, and a very large number that are relatively unknown. Caching is only good for the popular servers, and therefore doesn't require an enormous number of clients to cache all the popular servers.

3. NS caching is incredibly useful. The miss rates on name servers is approximately 25%, which means that if NS caching was turned off, the number of DNS queries making it to the root would increase by a factor of four. Since this is highly undesirable, NS caching should remain as-is; this is in contrast to A caching, which was addressed in item number 2.

DNS

This paper assumes that the reader knows the details of name lookup as they stood in 1988, which makes it somewhat difficult to understand the finer details of the paper. Below is my best attempt at understanding this paper.

The problems with the existing system, which was a HOSTS.txt file, were twofold. First, the file was growing large and unwieldy. Second, the administrative difficulties associated with the file were growing more than linearly. As a result, the DNS system was proposed.

DNS consists of multiple zones. Each zone is responsible for DNS lookups within its own zone. DNS supports multiple levels of hierarchy by allowing each zone to delegate sub-zones, which are also responsible for their own DNS lookups. This has the advantage of allowing different organizations to have different number of levels within their naming schemes. Each sub-zone gets a label, which can be a collection of 8-bit values. The length of a label cannot exceed 63 values, and the total length cannot exceed 256 such values (though apparently, those maximums are subject to change).

The DNS system consists of both DNS servers and DNS resolvers. The choice to separate these two systems stems from the fact that the storage of data and the protocols for accessing it should be different. The servers store data, whereas the resolvers implement the algorithms that enable them to access and distribute the data. In other words, different resolvers can be used throughout different systems so as to better serve the end-users, who may have a wide variety of operating systems or architectures.

One of the most interesting things about this paper is that they specifically took into consideration the difficulty of adopting DNS. They decided to use a relatively lightweight system so as not to discourage people from using DNS. They also wanted the system to be easily extensible; these two goals lead to a delicate balancing act.

I think the most interesting part of this paper is the part where they discuss failures. In particular, the tendency of system administrators to just copy examples out of the documentation, or to make mistakes (such as switching the identifier and TTL fields), led to incredibly poor performance across the system. Furthermore, their desire to simply get a system that worked, rather than a system that worked well, undermined the efforts of the DNS designers. In particular, system administrators would apparently not have their top level servers in different networks, but claimed that they were not a single point of failure anyway.

Thursday, October 30, 2008

Chord

Chord is a DHT (see previous post). The basic idea is that it provides a single function: lookup(key). The keys are generated from the SHA-1 hash function, which more or less guarantees their randomness. Each node is also given a key using the SHA-1 hash function. The nodes are arranged in a circle by hash value, and each node "stores" all keys that are larger than its own hash value, but smaller than the hash values of all other nodes (with wrapping).

The way queries are distributed through the network is with pointers from one node to various neighbors. Each node has a successor, as well as a list of "fingers". The fingers point to nodes that are halfway around the DHT, a quarter of the way around, an eighth, and so forth. To process a query, a node chooses the finger with the greatest hash value that does not exceed the value of the requested key, and forwards the request to that node. The request then propagates through the chain until it reaches the correct node.

Note that in order to maintain correctness, the fingers need not exist, and only the successor pointers need to be accurate. The successors and fingers are updated by a periodic stabilization process.

They tested the network in simulation, with 1000 nodes, and then again in practice, with 10. In both cases, they found that it helped to have several virtual nodes at each real machine, as a way of spreading out the keys across more nodes. They found that the number of hops that a query required was indeed O(log(n)), and that the network was indeed resilient to massive numbers of node failures.

The main complaint I have about Chord is that they expect applications to build redundancy on top of this system, probably by creating n keys for each piece of data that they want to store (where n is the number of copies), and then either 1) looking them up one at a time until one lookup succeeds or 2) looking them all up at once, and taking the first result. The first one wastes a lot of the application's time in the case of a failure, and the second one wastes a lot of bandwidth.

I think it would be interesting to implement Chord with redundancy built in to it. Not only would applications not have to worry about implementing this on top of Chord, but there might be good performance characteristics as well. Instead of searching for the one and only copy of a given key, you could search for the copy of the key that can be found by searching the "closest" finger. Alternatively, you could store link quality data, and transmit your query over the link with the lowest latency. Or, if there is data being stored at the nodes, perhaps you wish to find the copy of the data with the highest bandwidth to the origin of the request. But all of these would need some sort of built-in redundancy.

Wednesday, October 29, 2008

DHT Survey

This article is exactly as described: a survey of DHTs. The article first describes a few of the recent approaches to P2P systems, and why they are poor choices for scalability. Among current architectures are trees; trees have a problem with failures of high-level nodes, which can do a lot of damage to the system. Another approach is a simple broadcast-to-all-neighbors mechanism, which ends up sending too many messages in increasingly large systems. The superpeer approach used by KaZaA also has single point of failure problems.

Thus, a modern area of research is the distributed hash table (DHT), which implements one function (lookup), and in which no node is more important than any other node. Among the various implementations of the DHT are Chord, Pastry, Tapestry, Kademlia, and CAN. Chord uses a circular linked list, with "fingers", which point halfway around the circle, a quarter of the way around the circle, an eight of the way around the circle, and so forth.

Pastry, Tapestry, and Kademlia use a tree structure. In the tree structure, they store several leaves (nearby nodes), as well as a collection of nodes that cover the space of the first digit, a collection that cover the space of the first two digits, a collection for the first three digits, and so forth. Thus at any node, you can find a link to another node which shares the first n digits, where n is determined by the implementation.

CAN uses n-dimensional space to represent its nodes. Each node has a pointer to any neighbor that it shares an (n-1)-dimensional hyperplane with. New nodes choose a random location in hyperspace, and then force an existing node to split its dimensions in half. Lookup requests are routed along a straight-as-possible line in hyperspace.

The key point of all of these algorithms is that there is some sort of geometry, some sort of identifier, and some sort of distance function which can tell you the difference between two given identifiers. The key point of the DHT is to provide a reliable way to lookup keys independent of joins and disconnects of the machines that create the DHT.

Active Networks

This paper basically proposes allowing user extensions to internet routing. It claims a variety of other applications, but this is the only one that I don't think is currently possible. It comes with two basic ideas: signed code, and capsules.

The signed code is signed by some authority that guarantees that it uses the API provided, as well as that it uses the internet in some reasonable manner. It is distributed to nodes throughout the internet using soft-state, and has a cryptographically secure hash associated with the code itself. Note that this makes it difficult to spoof code.

The signed code then operates on "capsules", which are essentially packets with the additional ANTS header information. The additional header also has the hash of the code that is supposed to be processing it. This feature acts as both naming and security. It allows a wide variety of different services to be deployed across the internet.

There are two obvious benefits of Active Networks. The first is as a way to test alternate routing protocols. The second is as a way to test alternate queuing protocols. Unfortunately, the code must be implemented in java, and therefore has poor performance characteristics (the java requirement is important for proof-of-correctness concerns, including being strongly typed). This means it's not spectacularly useful except to test the correctness of new algorithms.

The possible pitfalls (as they have noted in the paper) include starvation of other internet services, not only by malicious users, but also by bugs in the code. These bugs need not be limited to a single node - if a routing algorithm routes in a loop, then there is no cure for the resources that will be consumed along the loop. This proposal basically suffers from the same problems as source routing, which has been turned off by most routers in the internet.

Tuesday, October 28, 2008

RON

The basic idea behind RON is that internet routing is not perfect, and that the BGP protocol is slow to recover from failures in the internet; therefore, sometimes we want to route packets through other hosts.

There are two basic kinds of failures that RON tries to work around: outages and performance failures. In the case of an outage, a path on the internet becomes unusable. In other words, the number of packets that reach their destination is none to few. In such cases, connections are completely unusable. In the case of a performance failure, the performance is so poor that the application is unable to function. For example, it is possible that the link between two nodes has great throughput, but the latency is high. This would mean that the link is unusable for an application like SSH.

The basic idea is to have a collection of nodes that can route through eachother in order to route around failures in the core of the internet. Each node periodically measures the quality of its links to each of the other N-1 nodes, and broadcasts this information to all nodes. Thus, each node ends up with a link state table of size O(N^2). They can then choose a path in the graph through which to route their packets. This path can be a single link, or multiple links. A client on each machine receives packets, and sends them to their next hop. In this way, RON allows applications to route around failures in the internet.

Results: RON takes an average of 18 seconds to route around failures in the internet. It can improve throughput by a factor of 2 or cut the loss rate by 0.05 or more in 5% of cases (each - it's possible that these two sets overlap).

One of the great things about RON is that it allows for application-specific path optimization.

Monday, October 20, 2008

Router Topology

This paper states that measurements like degree properties are insufficient to properly construct a network topology that truly resembles the internet. It presents several different construction mechanisms for graphs with the same degree distributions, and then compares them in terms of both their likelihood and their achieved bandwidth. Likelihood is defined as a ratio between L_max - L(graph) and L_max - L_min. The value L(graph) is given by \sum_{(i,j)\in E} d_i * d_j, where d_i is the degree of node i, and the (i,j) are the edges in the graph.

The paper first spends a lot of time stating that even though two different networks may have similar degree distributions, their topologies may be very different. Near the end, it describes several different methods for creating the graphs of which it speaks. Some of the graphs are as follows:
  • Preferential Attachment - Start with 3 connected nodes, then each new node attaches to one with probability proportional to its current degree. Add additional links until the degree distribution works out.
  • GRG - uses the numbers from PA to create a randomly (possibly unconnected) graph. Results in highly connected central nodes.
  • Heuristically Optimal Topology - starts with a few lightly connected core routers, and then high degree gateways attached to these routers, spanning out into trees of edge routers.
  • Abilene-inspired Topology - I wasn't exactly sure what this was.
  • Sub-optimal Topology - The core routers are a chain, instead of being lightly connected.
It then compares the bandwidths of these topologies, which all have the same degree distribution. It finds that the HOT has the highest bandwidth, followed by the Abilene-inspired Topology. It also finds that in general, the high bandwidth topologies have lower likelihoods.

Overall, this paper builds nicely on the previous power laws paper, but I would like to see a comparison of the other properties of these topologies that were explored in the power laws paper (such as the expected neighborhood size per number of hops, and the eigenvalues). Still, it actually provides a method for constructing an internet-resembling topology, which is good for simulations and theoretical calculations.

Power Laws

I am not sure how much can be said about this paper. It's not that the paper is short on material, it's just that I can't think of any way to summarize it other than stating the laws that they have discovered.

The basic idea of this paper is that various properties of the internet follow power laws. That is, variables are proportional to other variables raised to constant powers. These constants may vary a little over time, or, as in the case of the eigenvalues, may be true constants.

Here are the results:
  • The outdegree of a node is proportional to its "rank" (its rank according to its degree) raised to the power R, which ranged from -0.82 to -0.74 .
  • The frequency of a given degree is proportional to the degree raised to the outdegree power, which seems to be a constant between -2.2 and -2.15 .
  • The number of neighbors of a node that are within h hops is proportional to h to a power H.
  • The eigen value lambda_i (where they are sorted in order) is proportional to i ^(-0.5), regardless of era.
With the exception of the outdegree, these power seem to be constants regardless of the growth of the internet.

Sunday, October 19, 2008

Diffusion

I'm not certain when this paper was published, but it appears to be a preliminary guess at the operation of sensor networks. The key idea here seems to be that both requests and data will diffuse through the network.

The requests are shaped in the form of a type of data to be gathered, a location in which it is to be gathered, and a rate at which it is to be gathered. Each node that matches the request not only gathers the data for it, but it also combines similar requests, and remains agnostic towards the intended receivers.

Nodes then broadcast their information to all nearby nodes. Intended receivers then reinforce good paths by sending the same requests with higher data rates to the nodes from which it first received data. In this way, the receiver essentially carves out a path that his data will flow through.

I don't know why they're trying to avoid using the typical routing protocols that have been developed. Maybe it's to save power. I didn't see a comparison between this protocol and traditional routing protocols such as DSDV and DSR (other than omniscient multicast). I am not certain what they mean by omniscient multicast. In theory, the omniscient multicast they describe would be a lower bound on power consumption, but they claim to do five times better. They claim that this has something to do with the result of in-network aggregation, but a factor of five seems unlikely.

Saturday, October 18, 2008

TAG

This paper presents a way to collect aggregate values (thus it is called Tiny AGgregation) across sensor networks. The key difference between a sensor network and a normal network is the need to conserve power, which is what this paper addresses by decreasing both the size of packets and the number of packets which must be sent by one of these sensors.

They first present an SQL-like query language, which basically allows the following keywords: SELECT, GROUP BY, HAVING, WHERE, and some aggregation functions (as well as EPOCH and DURATION, which aren't that important to the algorithm). Next, they build a tree of nodes (where each node has exactly one parent, and the root is the user), using an arbitrary routing algorithm. This tree is updated periodically.

Next, they classify the aggregates that they can use. There are several different properties that are important. First is montonicity. Monotonic aggregates include things like MAX and MIN, and their monotonicity can be used to 1) evaluate some HAVING predicates before passing data to a parent, potentially saving one communication and 2) snooping on other nodes to determine whether or not to send your own data. The second property is the type of partial state that must be transmitted per node. The list of types contains Distributive, Algebraic, Holistic, Unique, and Content-Sensitive, which basically boils down to either constant per node or linear in the number of total children. This has a significant affect on the amount of communication and power consumption of nodes, especially parents close to the user. The other properties are Exemplary/Summary and Duplicate Sensitive, which affect optimizations that can be achieved.

Given these properties, each aggregate function is assigned a set of these properties, as well as two functions: a combining function and an evaluating function. The combining function allows each individual node to combine data from multiple children. The evaluation node allows the user to compute their actual desired value. Each node computes which group it belongs to, and separates the data by group id. Each node then hears the data from all of its children, and combines data for identical group ids. It then passes this information on to its parent. This allows computation to be done at the nodes, rather than done by the user, which saves transmissions and therefore power.

Wednesday, October 8, 2008

COPE

This paper describes COPE, which is a way of increasing throughput by XORing various packets together. The basic idea is that you can take advantage of the broadcast nature of wireless to transmit multiple packets at once to multiple destinations. The easiest way to do this is to XOR several packets together, which you can do if you know that the destinations have all but one of the packets you are about to send. This is made possible by a combination of eavesdropping on packets and storing them for a short duration.

An important question is how the information about packet availability is transmitted between nodes. From what I understand, they piggyback on normal packets inside the XOR header. If a node has no packets to send, it sends out this information in periodic updates. Packets are cached for about half a second before they disappear. In the case that no explicit information is available about whether or not a node has cached a particular packet, a sender may guess based on the link reliability. When transmitting, a sender requires that the overall probability that everyone has the desired packets must be at least some threshold (80% in this case).

No sender will ever encode two packets headed for the same next hop, so each sender keeps a set of virtual queues, one for each neighbor. However, XORing large packets with small packets is somewhat wasteful, so there are actually two virtual queues for each neighbor. Then, to figure out which packets to XOR together, the algorithm simply considers the packets at the head of each queue. The algorithm never delays packets, so if no XORing is possible, it sends just one packet.

They get great results out of this, not just because they can reduce the number of transmissions that need to be made, but also because this algorithm can increase fairness at various relays. If a relay is receiving traffic from two sources, and must then retransmit both packets, it can reduce the number of packets it needs to send from 2 to 1. This helps with fairness, since each of the three nodes is now sending one packet, instead of the relay trying to send at a 2:1 ratio.

ExOR

If this paper tells us what ExOR stands for, I couldn't find it. In any case, ExOR is a routing algorithm that takes advantage of variations in signal quality and the broadcast nature of wireless networks. The main idea behind ExOR is that the source broadcasts every packet (where packets are grouped into batches). Then, every receiver re-broadcasts packets in order of priority, where the priority is determined by the sender and sent with the ExOR header. The retransmitters do this by determining their place on the list, and then waiting for everybody above them on the priority list to transmit. Retransmitters only retransmit packets that have not been transmitted by higher priority nodes.

How do sources determine priority? From what I could tell in the paper, periodic pairwise tests are conducted, and then the information is transmitted throughout the network. Each node therefore ends up with a full map of the network and therefore knows the preferred intermediate nodes for any destination.

The claim is that this algorithm is great for bulk data transfer, because it can take advantage of 1) packets that travel farther than expected and 2) packets that don't travel as far as expected. Packets that travel farther than expected are simply that much closer to the destination, which is good. Packets that don't travel as far as expected would normally be retransmitted by traditional routing algorithms, whereas they at least make some progress with ExOR.

I have a few complaints about ExOR. First of all, it is useful only for bulk transfers. It is very good at improving throughput, but the latency cost is (I imagine) rather high (this is not mentioned in the paper). The paper gets around this by saying that it uses split TCP connections, and that ExOR should only be used for bulk file transfer. This naturally violates the E2E principle, not only because the TCP connection is being split, but also because there must be some decision maker at the wireless-wired interconnect that decides whether or not to use ExOR. This decision maker probably has to inspect packets (to see if they are file transfers of some sort - such as FTP or torrents), and guess on this issue. This is less than ideal.

ADDENDUM (8:56 PM): I forgot to mention that this is unlikely to work for incredibly dynamic networks, and is much better in static networks, where the broadcast-your-neighbors mechanism works much better (as we saw in the previous paper comparing DSDV and DSR).

Tuesday, October 7, 2008

Routing Protocols

This paper compares several protocols: DSDV, TORA, DSR, and AODV. Another pair of papers I should have read in the opposite order. Anyway, the metrics it uses are the percentage of packets that get through, overhead introduced by the routing protocol (in number of packets), and path optimality (difference between length of chosen path and minimum length of path). It uses a simulation in which nodes move around a rectangular area at random speeds, with a variety of predetermined rest-times. These rest times range from 0 to 900 seconds (continuous motion to no motion). Furthermore, this is the first paper to gather metrics while multiple transmitters are sending data (10, 20, or 30).

First, a bit about the algorithms. DSDV (Destination Sequenced Distance Vector) is an algorithm in which each node maintains data about the next hop for each destination. Nodes regularly broadcast updates to eachother whenever the network topology changes. These broadcasts include sequence numbers, allowing other nodes to choose the most recent updates.

All other routing protocols are based on dynamic requests for routes. TORA (Temporally Ordered Routing Algorithm) responds to requests by attempting to compute the "height" of each node with respect to the destination, and then sending an update back to the requester. DSR (Dynamic Source Routing) works by simply asking for a route, and then allowing the entirety of the found route to be propagated backwards through updates. AODV (Ad hoc On demand Distance Vector) is essentially a combination of the DSR and DSDV algorithms. It keeps the same state as DSDV, but its updates are done by request-update, instead of periodic pushes.

In summary, All algorithms get approximately the same throughput across different rest times, except for DSDV, which cannot make its paths converge for small rest times. However, the overhead (in number of packets) of DSR is the smallest. The overhead of AODV is smaller in bytes than the overhead of DSR, but those bytes are distributed across many small packets. Most overheads decrease as rest time increases, with the exception of DSDV, which remains constant. Despite this, DSDV does second-best in path optimality, next to DSR.

ETX

This paper puts forth a relatively simple metric, and then measures it against the current metric, which is number of hops. It then discusses open issues.

The problem with number of hops is twofold: 1) links can be asymmetric, causing transmissions to work just fine in one direction, whereas ACKs in the other direction will not be received and 2) this metric chooses longer, less reliable links over shorter, more reliable links. To solve this, each link is given a reliability (which is directed, to avoid the asymmetry problem), which is based on the number of packets it receives from its neighbors within a certain window (ten seconds), with packets spaced at one per second. The likelihood that a packet will cross a certain link is the product of the reliabilities for each direction. The expected number of transmissions of a packet is therefore 1/(product of probabilities).

This metric is then used in both DSDV and DSR, and the results are compared. The main measure is goodput for point-to-point connections going at maximum speed. This is a somewhat troublesome metric, since TCP doesn't necessarily go at maximum speed (or UDP either for that matter), and there are usually multiple network users. Regardless, the result is that the ETX metric does not perform significantly worse than number of hops, and performs much better for some links for which number of hops got essentially 0 goodput.

There are several outstanding problems with ETX. First, the reliability of a given link for packets of the size sent to measure reliability is not necessarily the same for either larger or smaller packets. This causes ETX to overestimate the reliability for data paths, and underestimate reliability for ACK paths.

It might be possible to keep a histogram of reliability scores rather than a simple running average. Test packets could be of random length between the size of an ack and the max packet size. The histogram might then be used to determine transmission ratios for varying packet lengths. The histogram could be as simple as (small, medium, large).

Thursday, October 2, 2008

Roofnet

Roofnet is essentially a way to get wireless internet working with a distribution of nodes that is neither random nor planned. The goal is to have relatively few internet connections servicing a community of users who will connect through this wireless mesh.

In order to accomplish this, a collection of Roofnet nodes were distributed, each installed by a local community member and connected to an ISP. If a given node finds itself to be connected, it broadcasts itself as a gateway. Otherwise, it acts as a DHCP server. Roofnet nodes use NAT to translate the IPs that it assigns to attached devices. Routing works as follows: each node keeps a list of nearby nodes, as well as the quality of the links between them. It also maintains this list for other nodes in the network, and is able to request such a list if its own list is deficient. It can then run Dijkstra's algorithm to determine the best path between itself and an internet gateway.

This routing algorithm can receive updates in multiple ways. First, if the quality of a link changes, the nodes at either end of that link will notice and update themselves. Second, other nodes will periodically ask for updates. Third, nodes can overhear other nodes' updates and adjust their own lists. To guess the throughput of a multi-hop connection, the inverse of the sums of the inverses of the individual throughputs are used. No reason is given for this (or at least I could not find one), other than the fact that it seemed to match empirical data.

Suffice it to say that the paper finds that they get decent throughput, and that they do so without using RTS/CTS. My question is this: how many users can they have connected to such a network at once (no more than 256 due to internal IP representations? no more than 65536 due to NAT?). Is it possible to set up port forwarding for services that users may require? How many simultaneous connections can be made before congestion drives throughput to a standstill? In other words, would Roofnet be a suitable way to connect a community? Also, who would be willing to foot the bill for the internet that Roofnet connects to?

Some of those questions are technical, and some of those questions are more logistic. But after reading through the paper (perhaps I did not read carefully enough), I still don't know what causes reduced throughput across a multi-hop connection. They claim that it is due to interference between packets transmitted by different nodes, but that seems unlikely to account for all of the reduction (especially because RTS/CTS didn't help). Is it because the routers are slow? Is it because there are many people on each given connection, and packets going across multiple connections must be scheduled multiple times? Somebody tell me if I completely missed this section of the paper.

Wednesday, October 1, 2008

Modeling Wireless Links

This paper is essentially a survey of the properties of wireless links. The paper does a good job of splitting wireless links into three prevalent categories: WLAN, cellular links, and satellite. It then makes a long list of ways in which wireless links differ from regular links, and describes how this affects each of the three categories.

The first characteristic of wireless links is losses due to errors rather than congestion. The paper points out that this is a relatively rare occurrence, but that when errors do occur, they are somewhat clustered, and affect the link in both directions. Alongside error losses is packet reordering, which the paper claims are currently relatively rare, but have the potential to have an adverse affect on transport protocols if the reordering is severe.

In addition to the above issues, there is also the issue of variation - both in bandwidth and in delay. Variation in delay can be caused by a number of factors, including link-layer recovery and handovers. Variations in bandwidth also exist in wireless systems (due to scheduling); these kinds of variations are likely to confuse transport layer protocols, which assume that neither the bandwidth nor the RTT is fluctuating too wildly. A fluctuating RTT can be the cause of unnecessary retransmits, or unnecessarily long recovery times. Fluctuating bandwidths can cause underuse of the link or sudden congestion.

Other important aspects include bandwidth asymmetry (which is fairly obvious) and handovers, which can cause the loss of all in-flight packets and a temporary but large increase in delay, as well as permanent changes in both bandwidth and RTT.

Monday, September 29, 2008

Wireless TCP

This paper essentially addresses the problem of using TCP over lossy wireless networks by comparing the available proposed solutions to this situation and extensions to TCP. These are split into three categories: 1) link-layer solutions, 2) connection-splitting solutions, and 3) E2E solutions. The link layer solutions attempt to do reliability at the link layer. Connection splitting solutions split the connection in to two pieces, allowing the base station to send ACKs upon receiving packets, and then making sure that the base station reliably transmits the data to the wireless nodes. E2E solutions simply modify the TCP behavior at the end nodes.

There are a lot of different methods of improving TCP performance. Among them are: SACKs (selective acknowledgements). Selective acknowledgements let the sender know exactly which packets were received, so as to avoid unnecessarily resending data. There is also explicit loss notification, which (as the name suggests) explicitly notifies the sender when a packet has been lost. This is advantageous because it helps the sender differentiate between losses that occur because of congestion and losses that occur because of noise in the link.

Other methods include retransmitting on the first duplicate ack, which is beneficial for lossy networks, where a loss is most likely to be due to noise rather than congestion. The final method is duplicate ack suppression at base stations, which gives the base station some time to retransmit the packet before it forwards the duplicate acks to the endpoint; this means that the endpoint will not decrease its congestion window when a non-congestion loss occurs, thereby keeping the throughput high.

As it turns out, the best method turns out to be a link-layer method that both suppresses duplicate acks at the base station and uses SMART based selective acknowledgments. This is called a TCP-aware algorithm because it must understand the meaning of the acknowledgements in order to work.

There's not much more to say about this paper other than the fact that there may be some overhead in determining that the connections being created are, in fact, TCP connections. It would be interesting to run the same experiments with both TCP and UDP and to determine how much overhead is involved in examining the packets. It might also be interesting to run some similar experiments with protocols that use NACKs instead of ACKs.

MACAW

While there are a lot of different corner cases that this paper presents, and a lot of different reasons for the algorithms presented, the final protocol is quite simple to explain. The protocol is essentially RRTS-RTS-CTS-DS-DATA-ACK. This is improved over the MACA algorithm, which only had RTS-CTS-DATA. The paper lists one-by-one the reasons for this algorithm. Let's start with the DS (data sending). Whenever there is congestion, a node must wait until all data has already been sent before sending an RTS. However, when the node doesn't have the ability to detect whether or not data is being sent, it has no way of knowing when to send the RTS. The DS message solves this problem by describing how much time it will take the data to be sent, thereby letting all nodes know when they can send their next RTS.

We can continue with the RRTS. If a node cannot send a CTS because it is currently overhearing traffic from some other stream, then it must notify potential senders when that traffic is done. It does so by using an RRTS, which tells the potential senders that they may now send an RTS message. Without the RRTS, potential senders would randomly send RTS messages, with almost no chance of doing so at a time when there was no interference.

The last addition to the protocol is the ACK. This comes straight out of the E2E paper, which says that if something can be done more efficiently at the lower levels, it should be done there, rather than pushed to the ends. In this case, the use of the ACK allows for much higher throughput, because wireless links are considerably noisier and less reliable than physical links. Since the loss of a packet does not necessarily mean congestion of the link in question, TCP's congestion avoidance is not an appropriate reliability mechanism for wireless networks. Thus, the reliability is programmed in at the link layer.

There is one more significant issue that this paper raises, which is the question of the back-off counter. The paper suggests using a shared value for this back-off counter, since everyone should have a consistent view of the network. Anyone that receives a back-off value immediately stores it. The paper also suggests multiple back-off values: one for the sender and one for the receiver. If an RTS is sent but not received, this means that there is congestion at the receiver. If a CTS is sent but not received, this means that there is congestion at the sender. The back-off counters are decremented accordingly. At the same time, to avoid massive fluctuations in back-off counters, the increase/decrease algorithm is changed to a increase by a factor of 1.5, and a decrease by a constant of 1. All of these, when put together, allow a conistent view of the network by all nodes, as well as a fairly stable back-off counter.

This paper mentions almost nothing about nodes entering and leaving the ranges of base stations. Suppose a node leaves the range of a base station while the base station is transmitting to it? Does the lack of an ACK mean that the base station will try to retransmit? Similarly, suppose a node enters the network. How does it go about making contact with a base station, while making sure not to cause interference with transmissions that are currently in progress? It could listen for DS messages, and then send an RTS at the next available time slot, but this would certainly not work if it were alone in the network. These questions seem to be left unanswered by the paper.

Tuesday, September 23, 2008

Gb Switched Router

This paper gives a general overview of modern techniques used to improve router performance. In retrospect, it would have been much better to read this paper before reading the other paper. Too late now. This paper covers the operation and need for switched backplanes, as well as the iSLIP and ESLIP protocols.

The general idea behind routers is that there are various linecards which take incoming packets and forward them to a decision maker, and are also able to transmit outgoing packets. The key issue is the decision maker. In early routers, it was a single CPU. As this CPU became the bottleneck in routers, it was replaced with an individual decision maker for each linecard, with a bus connecting the various CPUs (so that packets can move from one linecard to another). Then, since the bus became the bottleneck, the need for switched backplanes arose.

A switched backplane is essentially a grid that allows any linecard to connect to any other linecard. The only caveat is that a linecard can only connect to a single other linecard at a time. Therefore, there needs to be a scheduler to open and close connections on the grid. The algorithm used to determine which connections to open/close is called iSLIP. It works by allowing the sources to request destinations, letting the destinations choose a source in round-robin fashion, and then letting the sources choose a destination in round-robin fashion. The sources have multiple requests because of virtual output queuing (VOQ). In VOQ, each source has multiple queues so that no packet is stuck behind other packets which cannot reach their destinations.

A few other improvements are possible, such as the use of ESLIP to accomodate for multicast messages, and the use of priority queues to more tightly control delay.

This paper introduces a classification system for blocking in the backplane that is quite useful. The first kind is called Head of Line (HOL) blocking. HOL blocking occurs when there is no VOQ, and a packet is unable to reach its destination even though that destination is idle because that packet is second in the queue; the first packet is destined for a destination that is currently busy. While VOQ solves this issue, it does not solve the issue of input blocking our output blocking (where a packet at a given input/output cannot be transmitted because that input/output is currently transmitting a different packet). These issues cannot be solved except in the case of multicast, when they are solved by ESLIP.

Monday, September 22, 2008

Scaling Routers

I'm going to be honest here. I have no idea what this paper said. Not that it was a bad paper. I think this paper should stay in the syllabus. It's just that I would prefer to have read some more background material on router design so that I had the necessary experience to understand this paper.

That being said, this paper looks like it discusses some upcoming problems in scaling routers to deal with ever-increasing internet traffic, which seems like a good thing to do. Everybody wants their downloads to go faster, from movies and music to software updates. However, the paper seems to assume that in the future, we will have individual routers with lots of packets being pumped through them simultaneously. While I think that's possible, it seems like this router is being designed for high speed links across the continent. With large companies strategically placing caches close to clients, this kind of router may not be necessary. Then again, who knows?

The basic ideas behind this paper are doing very simple, yet elegant load balancing, and simplifying the meshing architecture. In order to get large amounts of throughput, they do a very simple round-robin type of load balancing. At the same time, by using optics to increase the speeds of the links connecting linecards, they are able to reduce the necessary number of links in the mesh.

The paper also introduces an algorithm called Full Ordered Frames First (FOFF) which allows it to guarantee that packets from the same connection are not transmitted out of order. Frankly, this seems like a somewhat silly restriction to put on a router. All of the papers we have read so far in this class have told us that the internet does not guarantee in-order delivery. The author claims that out-of-order delivery can cause unnecessary retransmissions in TCP. While these unnecessary retransmissions seem unlikely give the current implementation of TCP, if it is indeed the case that such a router would misbehave, it seems like the best thing to do would be to modify the TCP protocol.

Thursday, September 18, 2008

CSZ

This paper provides a hypothetical solution to the real-time flow vs data flow issue. It suggests that real-time flows that need certain guarantees from the network can explicitly request these guarantees, and that using routers with WFQ implemented will achieve these guarantees. Similarly, in order to achieve a good quality of service for adaptable real-time applications (that only ask for best effort) will have a FIFO+ queue within the WFQ to address their needs. However, these applications must provide guarantees to the network that they will operate within certain bounds (by providing a rate and a bucket size). If multiple levels of service are desired (such as an extra level of service for datagrams), then multiple FIFO+ queues can be used within the WFQ.

The idea behind a FIFO+ queue is that a packet should not be adversely affected by many different sources. Therefore, any router that causes a packet to be delayed (more than the average delay time for packets of its priority) will mark the packet with the extra delay time. A router receiving that packet can then "pretend" to have received it earlier. This reduces the overall jitter that will be experienced by packets on long trips. As with most algorithms, I have trouble imagining a world in which a Tier 2 ISP would not take advantage of this by marking all of the packets from its own customers.

Once again, the most difficult part of this algorithm is the incentive. There must be some sort of incentive (other than being a good samaritan) for a user to declare their packets to be of the "lowest" priority. Once again, economic incentives are an option, except that nobody wants to pay money by the byte (just ask a Verizon Wireless Customer).

Overall, I think one of the most important things that this paper has done is to classify the different kinds of traffic that flow across the internet. There exists a distinction between a two-way video chat on Skype, a video being watched on YouTube, and a high resolution trailer for Starcraft 2 that's only available for download; this paper has done an excellent job of making these distinctions clear.

Future Internet

This paper was written in 1994, and is quite insightful in that it predicts the rise of real-time video and audio links across the internet. The paper attempts to provide design decisions for the future of the internet. In particular, it suggests that there be multiple tiers of service within the internet, and that higher tiers should be explicitly requested by applications. The final suggestion is that there be admission control (i.e. the internet should reject some flows) to increase the quality of service for currently existing flows.

There are so many problems here, I don't even know where to begin. The suggestion of multiple tiers of service within the internet is a fine suggestion (because, in fact, real-time applications have very different needs when compared to applications such as FTP). Furthermore, the suggestion that the distinction be done above the IP layer (perhaps at the protocol layer) is also a good suggestion, because it allows the IP layer to serve its current purpose without taking on an unrelated purpose.

The suggestion that applications should explicitly request better service, while seemingly logical at first, creates a lot of difficulties in practice. The largest problem to be faced here is the problem of incentives. If there is no incentive to request the lower quality of service, everybody will request the higher quality of service. If the incentive is, as the paper suggests, monetary, then this will create additional problems. Nevermind the fact that people generally prefer to be billed a flat rate for unlimited internet access; if a remote host starts a high QoS TCP connection to an unsuspecting victim, will he be charged for the ACKs that he sends back? As another example, it would seem impossible for a server to declare that it wanted to stream videos to users at a high QoS, while still allowing users to "opt out" of the high QoS based on financial considerations. In general, connections between different nodes that requests different QoS levels will be tricky as a whole.

The admission control suggestion seems to be more applicable to traditional telephone conversations than it is to the internet. As far as the internet is concerned, everything is a packet, and there are no flows. Rejecting one flow so that another has a higher QoS would be very difficult to do on a router that is situated between two endpoints for several reasons: 1) the router may not see all of the traffic of that flow, making it difficult to reject the whole flow and 2) keeping track of which flows need high QoS and rejecting other flows because of it would entail large amounts of state within the router. The idea of rejecting flows so that others can have higher QoS is even more ridiculous when we consider where the responsibility should fall. Can it be done by any router? Can it be done by all ASs or just Tier 1 service providers? Who can decide which flows are more important than others? These issues will plague the idea of admission control if anybody ever tries to implement it.

I am even more surprised that this paper doesn't have a large focus on authentication. One of the largest issues of the modern internet is that in general, it is easy to spoof your own identity. If this issue could be solved, it would lead to great improvements in the internet from fairer routing to simpler authentication methods/interfaces. This issue is not considered very much in this paper. I am very disappointed with the paper's general lack of foresight.

Tuesday, September 16, 2008

XCP

This paper defines a new protocol in which routers and end nodes cooperate in order to achieve optimal flow and fairness. The router simply computes the difference between its current use and its optimal use, and then sends messages to nodes telling them how much to increase or decrease their window sizes by. Similarly, the end nodes must tell the routers their current congestion window size and estimated round trip time. By using information returned by the router (piggybacking on an ACK), the end nodes can reach optimal utilization in one round trip, as opposed to the many needed by TCP.

This system is especially useful in that it allows for efficient communications across mediums such as satellite links, which have high bandwidth, but also high round trip times. The high round trip times mean that algorithms such as TCP take an enormous amount of time to use all of the available bandwidth (since the congestion window is adjusted by 1 per round trip). In contrast, XCP allows it to be done in just one round trip, which should (hopefully) yield desirable results in such situations.

The only flaw is the obvious inability to control misbehaving users. Users who simply transmit too many packets are likely to get away with it, because the only thing the router does is ask them to slow down. It won't start dropping their packets until its queues are full, so a misbehaving client can really steal a lot of bandwidth from other users. Even when you try to push this responsibility to the edge routers, there is a problem: the Tier 2 ASs may try to "cheat" their Tier 1 providers to get more bandwidth for their own customers.

However, this proves the point that the internet could be much more efficient if only we were able to trust the end users. It also introduces a very important notion of the separation between fairness and efficiency control. By using a single algorithm to determine by how much the flow needs to change, and using a completely separate algorithm to determine how to distribute the adjustment between different nodes, they have essentially turned the router into 2 components which can be modified separately. This is important for future updates.

Speaking of updates, this paper glosses over the distribution issue. While it is true that it can probably coexist with TCP, separate queues for each is probably not ideal. Neither is sending a TCP packet to check for the existence of XCP, especially if that packet has to go through many routers to perform the check properly. And lastly, the installation of XCP one AS at a time is known as a "forklift upgrade", and is generally considered to be one of the worst ways to do an upgrade.

Monday, September 15, 2008

RED

This paper blathers on for pages and pages about nothing at all. I find it extremely boring to read. That being said, this paper describes a Random Early Detection algorithm for detecting congestion in gateways and signalling the nodes attached to the gateway by "marking" packets, which consists of either setting a bit in the header or dropping the packet. In order to accomplish this, it uses exponential averaging to keep track of the average queue length. This allows the algorithm to behave well in cases of "bursty" traffic, while still controlling the flow should it grow too large. It has a minimum threshold and a maximum threshold. When the calculated average is below the minimum threshold, no packets are marked. Above the maximum threshold, all packets are marked. Between the two thresholds, packets are marked with a probability to be calculated by the method below.

As the average queue length increases, the probability of marking a packet increases linearly from 0 to 1 between the minimum threshold and the maximum threshold. Thus, the formula looks something like p = (qlength - minthresh)/(maxthresh - minthresh). However, this formula generally leads to clumped markings. Since we would like the markings to be evenly spread out, we count the number of packets sent since the last marking, and then multiply this probability p by a factor: 1/(1 - count * p). With this formula, we can achieve a much better spacing between marked packets, and still (through tweaking of parameters) have the same number of marked packets. I didn't see a great mathematical justification for this in the paper. You can now skip reading the paper. That's really all there is to it. The rest is fluff.

I'd really recommend removing this paper from the syllabus. The idea is simple enough that it can be explained in five minutes, and the results are not fascinating. Furthermore, I find the graphs that are drawn very hard to read/understand.

The only benefits of this algorithm are 1) the small amount of state needed. An algorithm like fair queuing needs a lot of state, but this algorithm needs only a few variables per router. 2) the ability to easily handle "bursty" traffic. The good news is that this algorithm is quite easy to implement on routers that need to handle many many connections, and cannot be bothered to keep state for all of them. The bad news is that I can't see any other purpose for it.

The paper compares the algorithm to random dropping and FCFS, which are not fantastic policies to begin with. I would be more interested to see a comparison of this algorithm to fair queuing or CSFQ. In particular, an analysis of the processing time per packet would be a very interesting paper to read.

Thursday, September 11, 2008

Core Stateless Fair Queuing

This paper provides a quick and dirty solution to the large amounts processing power needed for Fair Queuing. Instead of doing the processing at the routers, this system uses the routers along the border of an AS to estimate the total amount of traffic that a given client is generating. It then labels the packets with this information, and allows the internal routers within an AS to drop packets at random, based on their current congestion. These packets are dropped randomly upon arrival at routers, which use exponential averaging to estimate what the "fair share" of traffic should be.

This paper was not a fantastic read for a couple of reasons. First and foremost was the confusing notation used to describe its exponential averaging scheme. Second was the tiny graphs that describe the results (I can barely make out which line is supposed to be which).

This paper very nicely addresses the issue of fair sharing of bandwidth for small regions, but I find it unlikely that such a system would extend to Tier 1 ISPs, which undoubtedly have enormous amounts of data flowing across their borders at any given time.

This paper also leaves out an important issue that was addressed by the Fair Queuing paper, which is latency for connections that are well under their fair share. I suspect that this kind of queuing does absolutely nothing in that regard, since it does not rearrange the packets that it receives. In this case, its best contribution is the ability to punish TCP users who try to exceed their fair share of bandwidth. However, they still can't do this to UDP users, which is an important issue.

Wednesday, September 10, 2008

Fair Queuing

This paper essentially answers the questions I raised last time about adversarial endpoints who abuse the congestion avoidance schemes to get more bandwidth for themselves. In essence, the paper describes a queuing algorithm that can be used by gateways to do two things: (1) ensure fair use of the bandwidth, and (2) punish misbehaving nodes.

The idea is relatively simple. The gateway considers packets on a source-destination pair basis. This allows it to distribute resources based on individual connections, while minimizing the amount of damage that a malicious user can do (the malicious user can get more than their fair share of bandwidth, but cannot do any useful work). Then, incoming packets are each assigned a number that is the sum of their arrival time and the length of time it will take to send the packet. The packet with the lowest value is transmitted first. There is an additional modification to the algorithm that gives an advantage to connections that have been idle for a while (by subtracting a small nonnegative value delta).

This essentially solves the problem of having adversarial TCP nodes which do not obey congestion avoidance rules and allow their sending windows to grow too large. And by punishing nodes that misbehave, it actually encourages the use of congestion avoidance algorithms, whereas a node attached to a round robin scheduler actually sacrifices some bandwidth by using the congestion avoidance algorithms.

The paper has lots of experimental data, but the data comes from relatively simple network graphs. It would be interesting to see the performance of this algorithm with a very large network with many gateways. One of the issues that is raised in the paper is that the algorithm must be both smart and fast, and this may not be possible for a gateway that has an enormous number of packets going through it.

Tuesday, September 9, 2008

Congestion Avoidance

This paper approaches the problem of congestion avoidance from a more practical and less theoretical point of view than the previous paper. It answers one very important question that I had after reading the last paper, which was the question of initial conditions. The "slow start" described in this paper is an exponential growth algorithm that will bring the TCP window to approximately the right size before the additive increase/multiplicative decrease algorithm is used.

This paper also addresses the issue of retransmission, which is specific to TCP (unlike the previous paper, which assumed a constant stream of data, this paper actually considers the ACKs). They provide an improved algorithm for calculating the amount of time to wait before retransmission that fares better under both light and hevay congestion. In particular, it does not unnecessarily retransmit when a packet has been delayed rather than dropped. This is due to an improved measure of the variance in round-trip time. Having put these two pieces in place, it becomes clear that a timeout is almost always due to packet loss (from congestion) rather than any other cause. This means that the timeout can be used as the signal that indicates that the network is congested.

This paper is quite good at explaining TCP's current implementation. It does not delve into a lot of mathematical depth, but it backs up its claims with experimental results. This paper, in conjunction with the other one, is a very good introduction to TCP in its current form.

This paper touches on the question of an adversarial client by mentioning that the gateway can be configured to drop excess packets from a client that is using more than its share of the available bandwidth. The detection of such a scenario and the policy for the packets dropped is vague.

The paper argues that the use of the timeout to detect congestion is superior to the use of an extra bit sent from the gateway to the client, since this method does not require the modification of any hardware in the network. However, it also heavily cites the previous paper on additive increase/multiplicative decrease, which clearly states that it is assuming distinct time steps and that the results of one time step are available before the next transmission. By using the timeout, this is clearly no longer the case. An important question is whether or not this inconsistency matters, and whether the algorithm would perform better if the gateways actually did return a bit that broadcasted whether or not it was experiencing congestion.

Monday, September 8, 2008

Increase/Decrease Algorithms

EDIT: Minor note - I hadn't read the other paper yet when I wrote this summary.

This paper starts with some simple assumptions: that we wish to build a distributed congestion avoidance system, which will take as input two values: the amount of data transmitted at the last time step, and whether the bottleneck was underutilized or congested last time step. Furthermore, it assumes that the new amount of data to be transmitted will be a linear function of the amount at the previous time step. It then tries to achieve the following goals: efficiency (hovering around the point of congestion), fairness, convergence, and distribution.

The last goal is easily achieved simply by the setup of the scenario. The other three goals can be described as follows:
  • efficiency - If the system is overutilized at one time step, the total utilization will go down for the next step. Similarly, if it is underutilized, the total utilization will go up for the next step.
  • fairness - Eventually, all clients will have approximately equal utilization of the bottleneck.
  • convergence - The system should eventually stabilize. Preferably, it should stabilize quickly, and the oscillations around the optimal point should be small.
By writing out a set of equations describing the utilization, and then manipulating them to guarantee the above criteria, the paper concludes that the decrease must be multiplicative, and that the increase may be additive and multiplicative. The paper also notes that a pure additive increase converges fastest, and therefore concludes that an additive increase and multiplicative increase is the best linear formula that satisfies the four conditions.

This explains the reason the TCP window grows and shrinks the way it is currently implemented. Most implementations of TCP have a window size that grows additively and shrinks exponentially. The indicator of whether or not there is congestion, in this case, is obviously the presence or absence of an ACK.

This paper makes the flawed assumption that all endpoints are willing to comply with the proposed algorithm. Since the algorithm is supposed to guarantee fairness, this is a terrible assumption. An adversarial client can ignore the multiplicative decrease and simply transmit at an unusually high rate. This will cause the adversary to have an unfairly large share of the bottleneck, while the well-behaved clients are told to continue reducing their transmission rates.

One important question that the paper doesn't address is the transience of the system. This is inherent in the way the internet works; clients will come and go frequently. There is a question of whether or not the system is still guaranteed to converge to the optimal efficiency if clients continue to arrive and leave.

This ties directly into the question of how quickly the system converges. If a single client is underutilizing its connection, does it take a long time or a short time for that client to approach a fair utilization? What about short-lived connections? What kind of initial conditions should a client set if it has just joined the network? And somewhat importantly, does this system inherently favor new clients, old clients, or neither? If it favors one over the other, is it possible for an adversary to get better performance than its peers by dropping and establishing a connection many times over? Or by keeping a connection alive long after it is no longer useful in the hopes that it will be useful again in the future?

Thursday, September 4, 2008

Inferring AS Relationships

This paper describes a technique that can be used to infer a set of relationships between different ASes. It first classifies these relationships into several different categories: provider-customer, sibling, and peer. The provider-customer relationship is obvious in that the provider shares its customers' routing information, but the customer does not share its provider's routing information. Peers will share routing information with eachother, but do not forward routing information obtained from one peer to others. And finally, siblings share all routing information with eachother.

The main way that they attempt to classify these relationships from data is to look at all the AS paths in the routing tables. Their assumptions are that for no AS path will the route go from a provider to a customer and then back to a provider. This makes sense, given the definition of the provider-customer relationship, but is not always necessarily the case. The algorithm then attempts to find the topmost provider in the path, and partitions the path accordingly. By repeating this process for a large collection of AS paths, the algorithm gains enough information to determine all of the relationships between ASes.

This algorithm has an interesting application: detecting incorrectly configured routers. By using an automated system to detect the current state of inter-AS relationships, the algorithm can determine when certain gateways have been incorrectly configured, and are forwarding reachability information that they are not supposed to be. By running the algorithm and then comparing the results with the intended use, it is easy to find misconfigured routers.

The most important improvements upon the algorithm seem to be in the area of adjusting for small anomalies in the routing tables. The algorithm was very successful at determining customer-provider relationships, but it was extremely unsuccessful at determining peer or sibling relationships. Part of the issue here may be that the reachability information being shared between different ASes may fall under different policies depending on the IP prefix. It would be interesting to try to separate these different policies for these different prefixes.

Interdomain Routing

This paper is an informal discussion of the BGP. It initially discusses the reasons for the existence of such a protocol; this includes a small lesson on the structure of the internet (with multiple tiers of ISPs, from universities and small business at tier 3 up to tier 1 ISPs). The goal of BGP4 is to allow different autonomous systems to exchange reachability information. A given tier 2 ISP wishes to inform peers about all of its customers (thereby providing its customers faster connections to customers of competing ISPs in the same region), and at the same time, it wishes to acquire reachability information from its provider (thereby giving its customers access to the parts of the internet that the ISP knows nothing about). There are a few additional constraints, such as the fact that no ISP wants to export information about its peers to other peers (since it is not benefiting directly from any traffic that its peers route through its gateways).

The protocol is fairly simple; it is simply a collection of data that is forwarded from one gateway to the next (and is sometimes changed when forwarded). The first is LOCAL PREF, which determines whether the information being given is for a local customer. This information is used (as stated above) to make decisions about whether or not to forward reachability information. The second is the ASPATH. This is the path that is taken within a single AS; its length is used as a factor to determine which routing information is optimal. MED is a number which can be used to give a preference to certain paths (i.e. routing long paths across a provider instead of internal gateways). There is also eBGP vs iBGP (prefer routes learned from peers to routes that were propagated through internal gateways), as well as IGP (internal path). If everything is a tie, then the AS employs some deterministic tiebreaker.

This paper is fairly unclear at times (since it actually appears to be lecture notes, I hope nobody's offended). It does a poor job of distinguishing between the protocol itself (which is fairly simple) and the task of designing a system that will efficiently and correctly implement the protocol. Often, there are references to the topology of the gateway connections, which is a topic that (while very interconnected with the protocol) should probably be treated separately.

One of the reasons the topology is best treated separately is because (as the paper says), there are several open problems in determining the best topology. In particular, in the section on failover and scalability, it is noted that in order to have a backup path in place in case of a failure, it is necessary to use pad the ASPATH that comes across a particular connection for particular IP prefixes. This is obviously not a scalable solution, and the question of how to do this is open-ended.

Tuesday, September 2, 2008

ARPA Protocol Design

This paper essentially describes the reasons for the split between the TCP and IP protocols, as well as for the adoption of the UDP protocol. On top of that, it explains why the internet as it is today uses packets and gateways rather than anything else. The first reason was for survivability. In case some part of the internet were to become non-functional, it was determined that the end users should not experience any issues as long as there remained a physical path between endpoints. Therefore, it was decided that the network should be stateless.

Second, it was important to link together many existing networks which did not share a lot of common features. The only features that were assumed were the ability to use some sort of addressing, and the ability to send packets across the network. The datagram based approach allowed the internet to work across all existing networks. Furthermore, one of the design goals was to allow a variety of different types of communication to occur over the internet. By providing the minimum possible building blocks, the datagram approach is essentially allowing many different protocols to be used, as long as those protocols are implemented by the endpoints of communication.

I think there are two priorities which could have been better addressed during the design of the internet. The first one is listed as the seventh and last priority in the paper: accountability. At the time, it was not considered very important, but it has actually become somewhat important right now to be able to trace cyber criminals back to their origins (consider hackers who write Trojans, or web hosts that contain illegal pornography). While this task has been taken over by ISPs, there is probably still room for improvement in this area.

The second priority is not mentioned at all in the paper: protection against malicious users. For protocols like TCP, it is assumed that during times of high congestion, each end user will use the same algorithm to reduce the amount of data they are sending through the network. Under normal circumstances, this ensures that the network will remain functional for all users. However, it is very possible (since these protocols are implemented on the host machines) for a malicious user to simply transmit as much data as he possibly can during a time of high congestion, thereby increasing his own throughput while punishing those who have implemented the protocol correctly (by decreasing everyone else's throughput). It would have been nice to see at least an in-depth discussion of this.