Amin Tootoonchian, Aurojit Panda, Sylvia Ratnasamy, Scott Shenker
The key challenge in scaling distributed applications often lies in managing shared state. While the distributed shared memory abstraction is a natural fit, retaining the performance benefits of single-node shared memory applications in a distributed setting remains challenging. Tasvir is a software distributed shared memory systems that lets applications retain memory-local performance and gracefully trade user-configurable visibility delay with scale. Tasvir’s design enables optimizations for state synchronization that is otherwise hard to achieve at the application level. With microbenchmarks, we show that Tasvir adds negligible overhead for read-heavy workloads while its overhead for write-all workloads is around 10% of application's runtime -- almost an order of magnitude lower than application-level log-replay. We are currently looking at Tasvir’s use cases in NFV, key-value stores, and machine learning. So far, we have added Tasvir support to the Redis key-value store; early results suggest that Tasvir adds negligible overhead in a write-heavy YCSB benchmark.
Emmanuel Amaro, Peter X. Gao, Aurojit Panda, Sylvia Ratnasamy, Scott Shenker
Demand paging is the fundamental mechanism for extending the physical memory in operating systems. Despite its obvious necessity, paging is considered harmful due to its high performance cost. In this paper we argue that this performance hit is not fundamental, and can be rectified by speeding the access to swap storage and improving the design of the paging mechanism itself. New technologies, such as RDMA and 3D-XPoint, have lowered access latencies, so this paper focuses on how to improve the paging mechanism. To that end, we present SyncSwap, a redesign of the swap subsystem that is 1.1× to 22.3× faster than Infiniswap, the most recent RDMA-based swap design. We further show how the increased performance of SyncSwap can potentially increase datacenter utilizations.
Ethan J. Jackson, Aurojit Panda, Kevin Lin, Luise Valentin, Johann Schleier-Smith, Nicholas Sun, Melvin Walls, Vivian Fang, Yuen Mei Wan, Scott Shenker
Recent industry trends indicate a shift toward programmatic management of distributed infrastructure. While the benefits of infrastructure APIs are widely understood, decidedly less attention has been paid to the design of such APIs. The de facto standard approach – a RESTful inter- face paired with a YAML representation – leads to unnecessary complexity for both container orchestrator implementations and distributed application developers. We argue that a better API for programmatic infrastructure is a general-purpose programming language. Such a language allows specification of distributed applications with strong primitives for abstraction, composition, and sharing, all while allowing deployment engines to remain ignorant of high level constructs. We present Quilt, an open source project that demonstrates these principles with two components - first Quilt.js, a JavaScript framework tailored to distributed application specification and second the Quilt Reference Implementation which deploys Quilt.js specifications across multiple cloud providers.
Zafar Qazi, Melvin Walls, Aurojit Panda, Vyas Sekar, Sylvia Ratnasamy, Scott Shenker
Cellular traffic continues to grow rapidly making the scalability of the cellular infrastructure a critical issue. However, there is mounting evidence that the current Evolved Packet Core (EPC) is ill-suited to meeting these scaling demands - EPC solutions based on specialized appliances are expensive to scale but recent software EPCs perform poorly, particularly with increasing numbers of devices or signaling traffic. We postulate that the poor scaling of existing EPC systems stems from the manner in which the system is decomposed which leads to device state being duplicated across multiple components which in turn results in frequent interactions between the different components. Instead, we design an alternate system architecture, PEPC, in which state for a single device is consolidated in one location and EPC functions are (re)organized for efficient access to this consolidated state. We prototype PEPC and show that it achieves high and scalable performance.
Kay Ousterhout, Christopher Canel, Max Wolffe, Sylvia Ratnasamy, Scott Shenker
In today's data analytics frameworks, many users struggle to reason about the performance of their workloads. Without an understanding of what factors are most important to performance, users can't determine what configuration parameters to set and what hardware to use to optimize runtime. The Monotasks project explores an execution model designed to make it easy for users to reason about performance bottlenecks. Rather than breaking jobs into tasks that pipeline many resources, we propose breaking jobs into monotasks - units of work that each use a single resource. We have found that explicitly separating the use of different resources simplifies reasoning about performance without sacrificing performance. Furthermore, separating the use of different resources allows for new optimizations to improve performance.
Aisha Mushtaq, Radhika Mittal, James Murphy McCauley, Sylvia Ratnasamy, Scott Shenker
In this paper we examine which factors in congestion control algorithms are key for achieving good performance. We find that the most essential feature is the switch scheduling algorithm - congestion control mechanisms that use Shortest-Remaining-Processing-Time (SRPT) achieve superior performance as long as the rate-setting algorithm at the end host is reasonable. We further find that while SRPT's performance is quite robust to end host behaviors, the performance of schemes that use scheduling algorithms like FIFO and FQ depend far more crucially on the rate-setting algorithm, and their performance is typically worse than what can be achieved with SRPT. Given these findings, we answer whether it is practical to realize SRPT in switches without requiring custom hardware. We analyze a simple scheme that emulates SRPT using only a small number of priority-scheduled queues and show that we achieve performance close to actual SRPT in datacenter contexts. Finally, we describe end host and switch software changes required to deploy this SRPT emulation in a datacenter.
Radhika Mittal, Alexander Shpiner, Aurojit Panda, Eitan Zahavi, Arvind Krishnamurthy, Sylvia Ratnasamy, Scott Shenker
In recent years, the usage of RDMA in datacenter networks has increased significantly, with RoCE (RDMA over Converged Ethernet) emerging as the canonical approach to deploying RDMA in Ethernet-based datacenters. RoCE NICs only achieve good performance when run over a lossless network, which is done through the use of Ethernet's Priority Flow Control (PFC) mechanism. However, PFC introduces significant problems, such as head-of-the-line blocking, congestion spreading, and occasional deadlocks. In this paper, we ask - is PFC fundamentally required for deploying RDMA over Ethernet, or is their use merely an artifact of the current RoCE NIC design? We find that while PFC is indeed needed for current RoCE NICs, it is unnecessary (and sometimes significantly harmful) when one updates RoCE NICs to a more appropriate (yet still feasible) design. Thus, our findings suggest that to avoid the many problems with PFC in RDMA datacenters, we should adopt this new RoCE NIC design.
Aurojit Panda, Sangjin Han, Keon Jang, Melvin Walls, Sylvia Ratnasamy, Scott Shenker
The move from hardware middleboxes to software net-work functions, as advocated by NFV, has proven more challenging than expected. Developing new NFs remains a tedious process, with developers frequently having to re-discover and reapply the same set of optimizations, while current techniques for safely running multiple NFs (using VMs or containers) incur high performance overheads. In this paper we describe NetBricks, a new NFV framework that aims to improve both the building and running of NFs. For building NFs we take inspiration from databases and modern data analytics frameworks (e.g.,Spark andMap Reduce) and build a framework with a small set of customizable network processing elements. To improve execution performance, NetBricks builds on safe languages and runtimes to provide isolation in software, rather than relying on hardware isolation. NetBricks provides memory isolation comparable to VMs, without the associated performance penalties. To provide efficient I/O, we introduce a novel technique called zero-copy software isolation.
Peter Gao, Akshay Narayan, Sagar Karandikar, Joao Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, Scott Shenker
Traditional datacenters are designed as a collection of servers, each of which tightly couples the resources required for computing tasks. Recent industry trends suggest a paradigm shift to a disaggregated datacenter (DDC) architecture containing a pool of resources, each built as a standalone resource blade and interconnected using a network fabric. A key enabling (or blocking) factor for disaggregation will be the network - to support good application-level performance it becomes critical that the network fabric provide low latency communication even under the increased traffic load that disaggregation introduces. In this paper, we use a workload-driven approach to derive the minimum latency and bandwidth requirements that the network in disaggregated datacenters must provide to avoid degrading application-level performance and explore the feasibility of meeting these requirements with existing system designs and commodity networking technology.
Radhika Mittal, Rachit Agarwal, Sylvia Ratnasamy, Scott Shenker
In this paper we address a seemingly simple question - is there a universal packet scheduling algorithm? More precisely, we analyze (both theoretically and empirically) whether there is a single packet scheduling algorithm that, at a network-wide level, can perfectly match the results of any given scheduling algorithm. We find that in general the answer is ``no''. However, we show theoretically that the classical Least Slack Time First (LSTF) scheduling algorithm comes closest to being universal and demonstrate empirically that LSTF can closely replay a wide range of scheduling algorithms. We then evaluate whether LSTF can be used in practice to meet various network-wide objectives by looking at popular performance metrics (such as average FCT, tail packet delays, and fairness); we find that LSTF performs comparable to the state-of-the-art for each of them. We also discuss how LSTF can be used in conjunction with active queue management schemes (such as CoDel and ECN) without changing the core of the network.
Amin Tootoonchian, Aurojit Panda, Chang Lan, Melvin Walls, Katerina Argyraki, Sylvia Ratnasamy, Scott Shenker
Carriers have long provided performance guarantees in the form of service-level objectives (SLOs). The Network Function Virtualization (NFV) movement is causing carriers to replace dedicated middleboxes with Virtual Network Functions (VNFs) consolidated on shared servers. In their haste to adopt this more cost-effective approach, they have ignored the question how (and even whether) one can achieve SLOs with software packet processing. The key challenge is the high variability and unpredictability in throughput and latency introduced when VNFs are consolidated. We present ResQ, a resource manager for NFV that guarantees a high degree of performance isolation among consolidated VNFs, and enforces performance SLOs for multi-tenant NFV clusters in a resource efficient manner. We show that using ResQ, for a wide range of VNFs, the maximum throughput and 95th percentile latency degradation is below 2.9% and 2.5% (compared to 44.3% and 24.5%) respectively. ResQ also achieves 60%-236% better resource efficiency for enforcing SLOs that contain contention-sensitive VNFs compared to previous work.
Shoumik Palkar, Chang Lan, Sangjin Han, Keon Jang, Aurojit Panda, Christian Maciocco, Sylvia Ratnasamy, Joshua Reich, Luigi Rizzo, Scott Shenker
By moving network appliance functionality from proprietary hardware to software, Network Function Virtualization promises to bring the advantages of cloud computing to network packet processing. However, the evolution of cloud computing (particularly for data analytics) has greatly benefited from application-independent methods for scaling and placement that achieve high efficiency while relieving programmers of these burdens. NFV has no such general management solutions. To this end, we present E2 -- a scalable and application-agnostic scheduling framework for packet processing.
Sangjin Han, Keon Jang, Dongsu Han, Sylvia Ratnasamy
Modern NICs implement various features in hardware, such as protocol offloading, multicore supports, traffic control, and self virtualization. This approach exposes several issues: protocol dependence, limited hardware resources, and incomplete/buggy/non-compliant implementation. Even worse, the slow evolution of hardware NICs due to increasingly overwhelming design complexity cannot keep up in time with the new protocols and rapidly changing network architectures. We introduce the SoftNIC architecture to fill the gap between hardware capabilities and user demands. Our current SoftNIC prototype implements sophisticated NIC features on a few dedicated processor cores, while assuming only streamlined functionalities in hardware. The preliminary evaluation results show that most NIC features can be implemented in software with minimum performance cost, while the flexibility of software provides further potential benefits.
Justine Sherry, Peter X. Gao, Soumya Basu, Aurojit Panda, Mazhiar Manesh, Luigi Rizzo, Christian Maciocco, Sylvia Ratnasamy, Arvind Krishnamurthy, Scott Shenker
Middleboxes -- such as proxies, WAN Optimizers, and intrusion detection systems (IDSes) -- are often stateful, keeping logs of active connections, port mappings, packet caches, and other data about users, connections, and services. When middleboxes fail, lost state can lead to reset connections, lost logs, and security concerns. Hence, like other systems, it is desirable to design middleboxes for high availability, with automatic failover when a device suffers a hardware or electrical failure; such failover should ensure that no state is lost. However, middleboxes are a challenging target for HA, first because their state changes rapidly (sometimes even updating per-packet) and second, because latency expectations for packet service are typically under a millisecond to avoid inflating flow completion times. To this end, we present FTMB, a record-and replay approach to middlebox failover which records middlebox state without imposing a heavy latency penalty on traffic.
Aurojit Panda, Murphy McCauley, Amin Tootoonchian, Ahmed ElHassany, Vjeko Brajkovic, Barath Raghavan, Scott Shenker
With the increasing prevalence of middleboxes, networks today are capable of doing far more than merely delivering packets. In fact, to realize their full potential for both supporting innovation and generating revenue, we should think of carrier networks as service-delivery platforms. This requires providing open interfaces that allow third-parties to leverage carrier-network infrastructures in building global-scale services. In this position paper, we take the first steps towards making this vision concrete by identifying a few such interfaces that are both simple-to-support and safe-to-deploy (for the carrier) while being flexibly useful (for third-parties).
Colin Scott, Andreas Wundsam, Barath Raghavan, Aurojit Panda, Andrew Or, Jefferson Lai, Eugene Huang, Zhi Liu, Ahmed El-Hassany, Sam Whitlock, H.B. Acharya, Kyriakos Zarifis, Scott Shenker
Software bugs are inevitable in software-defined networking con- trol software, and troubleshooting is a tedious, time-consuming task. In this paper we discuss how to improve control software troubleshooting by presenting a technique for automatically iden- tifying a minimal sequence of inputs responsible for triggering a given bug, without making assumptions about the language or in- strumentation of the software under test. We apply our technique to five open source SDN control platforms—Floodlight, NOX, POX, Pyretic, ONOS—and illustrate how the minimal causal sequences our system found aided the troubleshooting process
Jonas Fietz, Sam Whitlock, George Ioannidis, Ed Bugnion, Katerina Argyraki
Where should multi-tenant data centers implement the network abstractions (virtual broadcast domains, virtual private networks, security groups) that are exposed to tenants? The current trend is to implement these abstractions at the hypervisor, in most cases without any support from physical networking equipment. With this project, we explore the idea of leveraging the architectural support present in commodity ASICs to offload the implementation of network abstractions from the hypervisors. The project focuses on two specific scenarios: (a) VXLAN networks that expose to their tenants virtual broadcast domains and (b) non-virtualized networks like Amazon EC2 that expose to their tenants VM IP addresses and security groups. Our key difference from state of the practice and art is that we think of the compute rack – the top-of-rack switch and its associated servers — as the natural building block for datacenter infrastructure, rather than individual hypervisors, with the hope that we can demonstrate greater performance and scalability, without loss of flexibility.
Aurojit Panda, Katerina Argyraki, Scott Shenker
We explore how to verify useful properties about networks that include "dynamic" elements, whose state and functionality may depend on previously observed traffic, e.g., caches, WAN optimizers, firewalls, and DPI boxes. We present the design and implementation of a tool that takes as input a network specification and verifies properties such as "traffic from host A will never reach host B directly or indirectly (e.g., through caching)"; or "traffic from A to B will always pass through a given middlebox (e.g., firewall or transcoder)." Our tool leverages recent advances in model checking. The challenge lies in scaling model checking with network size and complexity, and we address this by (a) modeling only globally visible middlebox behavior and (b) defining and focusing on "rest of network oblivious" (RONO) properties --- properties that hold for a given traffic class independently from the rest of the network state. We have implemented our approach and can verify realistic invariants on very large networks containing 30,000 middleboxes
Mihai Dobrescu, Katerina Argyraki
Software dataplanes are emerging as an alternative to traditional hardware switches and routers, promising programmability and short time to market. These advantages are set against the risk of disrupting the network with bugs, unpredictable performance, or security vulnerabilities. This project explores the feasibility of verifying software dataplanes to ensure smooth network operation. For general programs, verifiability and performance are competing goals; our starting point is that software dataplanes are different---we can write them in a way that enables verification and preserves performance. The goal of the project is to produce (a) a set of rules for writing verification-friendly dataplane software and (b) a verification tool that takes as input a software dataplane that follows these rules, and (dis)proves that the dataplane satisfies useful properties like crash-freedom, bounded-execution, and filtering properties.
Seyed Fayazbakhsh, Sagar Chaki, Vyas Sekar
Many recent efforts have leveraged Software-Defined Networking (SDN) capabilities to enable new and more efficient ways of testing the correctness of a network’s forwarding behaviors. However, realistic network settings induce two additional sources of complexity that fall outside the scope of existing SDN testing frameworks: (1) complex nature of real-world data planes (e.g., stateful firewalls, dynamic behaviors of proxy caches), and (2) complexity of intended network policies (e.g., service chaining). In this project, we envision a new resting framework called FlowTest for testing such stateful and dynamic network policies that can systematically explores the state space of the network data plane to verify its behavior w.r.t. policy goals.
Zafar Qazi, Samir Das, Vyas Sekar
Network functions virtualization (NFV) is an appealing vision that promises to dramatically reduce capital and operating expenses for cellular providers. However, existing efforts in this space leave open broad issues about how NFV deployments should be instantiated or how they should be provisioned. In this project, we plan to develop a quantitative framework that will help network operators systematically evaluate the potential benefits that different points in the NFV design space can offer.
Zafar Qazi, Luis Chiang, Cheng Chun Tu, Minlan Yu, Vyas Sekar
Networks today rely on middleboxes to provide critical performance, security, and policy compliance capabilities. Achieving these benefits and ensuring that the traffic is directed through the desired sequence of middleboxes requires significant manual effort and operator expertise. In this respect, Software-Defined Networking (SDN) offers a promising alternative. Middleboxes, however, introduce new aspects (e.g., policy composition, resource management, packet modifications) that fall outside the purvey of traditional L2/L3 functions that SDN supports (e.g., access control or routing). The SIMPLE project is developing a SDN-based policy enforcement layer for efficient middlebox-specific "traffic steering". In designing SIMPLE, we take an explicit stance to work within the constraints of legacy middleboxes and existing SDN interfaces. To this end, we address algorithmic and system design challenges to demonstrate the feasibility of using SDN to simplify middlebox traffic steering. In doing so, we also take a significant step toward addressing industry concerns surrounding the ability of SDN to integrate with existing infrastructure and support L4–L7 capabilities.
Seyed Fayazbakhsh, Luis Chiang, Minlan Yu, Jeff Mogul, Vyas Sekar
Middleboxes provide key security and performance guarantees in networks. Unfortunately, the dynamic traffic modifications they induce make it difficult to reason about network management tasks such as access control, accounting, and diagnostics. This also makes it difficult to integrate middleboxes into SDN-capable networks and leverage the benefits that SDN can offer. In response, we develop the FlowTags architecture. FlowTags-enhanced middleboxes export tags to provide the necessary causal context (e.g., source hosts or internal cache/miss state). SDN controllers can configure the tag generation and tag consumption operations using new FlowTags APIs. These operations help restore two key SDN tenets: (i) bindings between packets and their "origins," and (ii) ensuring that packets follow policy- mandated paths. We develop new controller mechanisms that leverage FlowTags. We show the feasibility of minimally extending middleboxes to support FlowTags. We also show that FlowTags imposes low overhead over traditional SDN mechanisms. Finally, we demonstrate the early promise of FlowTags in enabling new verification and diagnosis capabilities.
Seyed Fayazbakhsh, Mike Reiter, Vyas Sekar
Network function outsourcing (NFO) enables enterprises and small businesses to achieve the performance and security benefits offered by middleboxes (e.g., firewall, IDS) without incurring high equipment or operating costs that such functions entail. In order for this vision to fully take root, however, we argue that NFO customers must be able to verify that the service is operating as intended w.r.t.: (1) functionality (e.g., did the packets traverse the desired sequence of middlebox modules?); (2) performance (e.g., is the latency comparable to an "in-house" service?); and (3) accounting (e.g., are the CPU/memory consumption being accounted for correctly?). In this work, we formalize these requirements and present a high-level roadmap to address the challenges involved.
Norbert Egi, Sylvia Ratnasamy, Mike Reiter, Guangyu Shi, Vyas Sekar
Today middlebox platforms are expensive and closed systems, with little or no hooks for extensibility. Furthermore, they are acquired from independent vendors and deployed as standalone devices with little cohesiveness in how the ensemble of middleboxes is managed. As network requirements continue to grow in both scale and variety, this bottom-up approach puts middlebox deployments on a trajectory of growing device sprawl with corresponding escalation in capital and management costs. To address this challenge, we present CoMb, a new architecture for middlebox deployments that systematically explores opportunities for consolidation, both at the level of building individual middleboxes and in managing a network of middleboxes. Our initial prototype implementation in Click showed that CoMb reduces the network provisioning cost 1.8-2.5x and reduces the load imbalance in a network by 2-25x.
Praveen Naga Katta, Haoyu Zhang, Michael Freedman, and Jennifer Rexford
Software-defined infrastructure relies on a logically-centralized controller to run higher-level applications and control a distributed collection of nodes (e.g., switches and middleboxes). If the controller fails, the network fails. Replicating the controller is essential for reliability. However, conventional fault-tolerance techniques like replicated state machine are not sufficient for replicating the controller, because the underlying network nodes maintain state across multiple interactions with the controller. We are designing and prototyping a new replication technique -- an extension of replicated state machine -- that enables seamless failover from one controller instance to another, while offering a simple abstraction to controller applications.
Praveen Naga Katta, Omid Alipourfard, Jennifer Rexford, and David Walker
To realize the fine-grained policies in SDNs, hardware switches leverage Ternary Content Addressable Memory (TCAM) to match packets on multiple header fields at line rate. However, most commodity switches have just a few thousand TCAM entries. Although future switches will have larger TCAMs, the cost and power requirements for TCAM remain a limitation. We propose a hardware-software hybrid switch design that relies on rule caching to provide large rule tables at low cost. Unlike traditional caching solutions, we neither cache individual rules (to respect rule dependencies) nor compress rules (to preserve the per-rule traffic counts). Instead we "splice" long dependency chains to cache smaller groups of rules while preserving the semantics of the network policy. Our design satisfies four core criteria: (1) elasticity (combining the best of hardware and software switches), (2) transparency}(faithfully supporting native OpenFlow semantics, including traffic counters), (3) fine-grained rule caching (placing popular rules in the TCAM, despite dependencies on less-popular rules), and (4) adaptability (to enable incremental changes to the rule caching as the policy changes).
Srinivas Narayana, Jennifer Rexford, and David Walker
Monitoring the flow of traffic along network paths is essential for SDN programming and troubleshooting. For example, traffic engineering requires measuring the ingress-egress traffic matrix; debugging a congested link requires determining the set of sources sending traffic through that link; and locating a faulty device might involve detecting how far along a path the traffic makes progress. We introduce a query language that allows each SDN application to specify queries independently of the forwarding state or the queries of other applications. The queries use a regular-expression-based path language that includes SQL-like "groupby" constructs for count aggregation. We track the packet trajectory directly on the data plane} by converting the regular expressions into an automaton, and tagging the automaton state (i.e., the path prefix) in each packet as it progresses through the network.
Xin Jin, Jennifer Gossels, Jennifer Rexford, and David Walker
To realize the vision of SDN--an "app store" for network-management services--we need a way to compose applications developed for different controller platforms. For instance, an enterprise may want to combine a firewall written on OpenDaylight with a load balancer on Ryu and a monitoring application on Floodlight. To make this vision a reality, we propose a new kind of hypervisor that allows multiple applications to collaborate in processing the same traffic. Inspired by past work on Frenetic, our hypervisor supports a flexible configuration language that can combine packet-processing rules from different applications using composition operators. The hypervisor also protects the network from misbehaving controller applications by applying access-control policies that constrain what each controller can see or do, and can present each controller with its own view of the network topology. Our prototype is an extension of ON.Lab's OpenVirteX platform.
Peng Sun, Laurent Vanbever, and Jennifer Rexford
Large enterprises, like university campus and corporations, typically connect to the Internet via multiple upstream providers. Most of the traffic comes from external servers to internal clients, leading to congestion on the incoming links at the border routers. However, existing techniques for traffic engineering focus mainly on multi-homed route control for outgoing traffic. Inbound traffic engineering is more challenging, because Internet routing is destination-based, meaning that the sending network selects the path. In this work, we design an SDN-based solution that controls which entry point receives which set of traffic flows. We evaluate our technique "in the wild" using the Transit Portal, and on local measurements from the Princeton campus network.