Taurus: An Intelligent Data Plane

Tushar Swamy, Alexander Rucker, Muhammad Shahbaz, Neeraja Yadwadkar, Yaqi Zhang, and Kunle Olukotun
Stanford University

ABSTRACT
Emerging trends such as cloud computing, the internet of things, and augmented and virtual reality demand highly responsive, available, secure, and scalable networks to meet users’ quality of experience expectations. Operators currently manage these networks and protocols using a variety of ad-hoc tools and scripts; however, the unpredictable and complex interactions between network conditions and workloads make such manual tuning difficult.

Machine learning (ML) can help approximate and automate these complex interactions that govern today’s hyperscale datacenter networks [3, 4, 7]. Recent proposals generate ML models for networks to produce recommendations for policies like routing and congestion control [16]. At present, these models run on a logically-centralized control plane that infers learned policies, causing delays of tens of milliseconds when updating network devices [5, 9]. This is because modern reconfigurable switching devices (e.g., RMT [1]) lack the necessary operations (i.e., loops and multiplication) needed to run these ML models in the data plane. Therefore, for policies like anomaly detection where inputs to the ML model may vary over time (e.g., payload size or time-windowed features [15]), most packets—even of a single flow—need to traverse the control plane, thus, significantly increasing load on the controller and inflating flow latencies [11].

In this paper, we present Taurus, an intelligent data plane architecture for ML inference at line rate. Taurus extends the Protocol Independent Switch Architecture (PISA) [1, 8] by adding an ML-capable block with a map-reduce abstraction to the match-action table pipeline (Figure 1a). The map-reduce block receives pre-processed network and packet features from the preceding match-action tables and the parser, and feeds results to the following match-action tables for post processing to set the network action (e.g., drop, route, or encapsulate a packet based on the prediction). The design of the map-reduce block is based on a spatial SIMD architecture that can support a variety of ML models. It is composed of Compute Units (CU) and Memory Units (MU) interleaved in a grid, joined by a static interconnect (Figure 1b) [13]. CUs are composed of programmable Functional Units (FUs) and registers organized across lanes and stages; a CU can perform either a map, reduction, or both. This restriction allows high performance for regularly-structured applications (e.g., ML) with low configuration overhead.

![Figure 1: Taurus data plane architecture](image)

Table 1 shows that the cost of adding ML models to a network data plane is small. Taurus can run simple models such as SVM-based anomaly detection [10] with as little as 6.1% area and 1.1% power overhead. The Deep Learning (DL) network [14] consumes more resources but the area and power utilization is still under 12% and 2%, respectively. Both models meet the high-end switch line rates of a billion packets per second (i.e., 1 GPkt/s). The third application, Indigo [16], is an endpoint application for congestion control that could be deployed on Taurus-based network interface cards (NICs). Indigo’s DL network is unrolled to meet 40 Gbps line rate for minimum-sized packets (i.e., 0.08 GPkt/s). While the original DL network ran once every 10ms, a Taurus-based NIC pipeline runs Indigo in 12.5ns intervals. With Taurus, we demonstrate that data plane devices can infer from ML models at line rate with several orders of magnitude lower latencies than traditional control-plane approaches [2, 5, 12].

<table>
<thead>
<tr>
<th>App</th>
<th>Model</th>
<th>Perf. (GPkt/s)</th>
<th>Area (mm²)</th>
<th>Power (mW)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Anomaly</td>
<td>SVM</td>
<td>1.00</td>
<td>4.59</td>
<td>263</td>
</tr>
<tr>
<td>Anomaly</td>
<td>DNN</td>
<td>1.00</td>
<td>8.80</td>
<td>506</td>
</tr>
<tr>
<td>Indigo</td>
<td>LSTM</td>
<td>0.08</td>
<td>17.73</td>
<td>1018</td>
</tr>
</tbody>
</table>

Table 1: Performance, area, and power overheads for three different application models. Overheads are calculated relative to a 300 mm² chip with 4 reconfigurable pipelines [6], each drawing an estimated 25 W.
REFERENCES


