Traditionally, large scale-up servers used cache-coherent buses for inter-processor communications. These proprietary buses and servers are very costly and power-hungry. Today’s powerful x86 servers replace proprietary scale-up architectures with low-cost machines connected through high-speed, low-latency clustered interconnects. This article will take an in-depth view of their cost and power benefits compared to scale-up architectures, and explain that Ethernet can be tunneled through a PCI Express (PCIe) fabric to provide a very-high-performance, low-cost cluster interconnect suitable for storage IO.
High-performance computing clusters, the first step towards state-less computing with storage IO, can be deployed within next-generation data centers at a fraction of the cost and power of proprietary scale-up architectures and with very high performance. When Ethernet is tunneled through PCIe, it provides the throughput of 32Gbps Ethernet in today’s PCIe Gen 2 (2.5GT/s) to each processor and 64 Gbps Ethernet with the upcoming Gen 3 (5GT/s) signaling rate on x8 links. The storage IO processors (IOPs) rating can be improved dramatically with Ethernet tunneling and use of newer protocols like Fibre Channel over Ethernet (FCoE). PCIe, as a highly reliable protocol, provides the integrity and dependability needed for Data Center Ethernet (DCE) server fabrics.
SMP Architectures
Symmetric Multi Processing (SMP) is a way to connect very high-bandwidth, low-latency, cache-coherent interconnects to CPUs to create a high-performance clusters using non-uniform memory access (NUMA). These machines are highly bandwidth- and latency-sensitive to certain applications. Industry-leading suppliers (e.g. IBM, HP, SUN, Cray/SGI) have created proprietary low latency interconnects to connect these processors.
Figure 1 shows an eight-way SMP machine, with seemingly redundant cross connections between the two four-way machines limiting the worst-case path between any two processor complexes to three hops.

[Figure 1: Eight-Way SMP Machine]
IO is connected below the processor complexes via bridges from the proprietary interconnect to PCIe.
Alternative architectures can support inter-processor communications by tunneling Ethernet through a PCIe fabric connected to each processor. This can eliminate the proprietary cache-coherent fabric, reducing cost and power consumption. PCIe IO virtualization results in even higher CPU utilization and further savings in processor cycles.
Software Architecture in SMP machines
The software architecture is a single coherent memory image above multiple CPUs and cores. The single image boots from DAS or NAS or SAN and brings in the application. The application runs homogenously and does not know the implications of running on multiple processors. I/O interrupts are handled by any of the processors, each load distributed among the processors to minimize cache to cache transfers. This architecture is very easy to program but exacts a penalty in high cost, due to its proprietary nature and voracious power consumption.
The software stack on these SMP machines is shown below for different traffic types.

[Figure 2: Software Stack in SMP]
Scale-out Architectures
Scale-out architecture has been built to address some of the issues discussed above. The scale-out architecture is easily scalable at fraction of the cost of SMP. It saves a significant amount of power, cost and management of the machines. These architectures are shown in next section.
SMP machines are proprietary, costly and power-hungry, and are being replaced with scale-out architectures consisting of clusters of x86 machines running Linux in data centers.
These architectures employ very efficient, high-speed networking technologies and RDMA protocols. RDMA provides highly efficient, single-copy application-buffer-to-application-buffer transfers, which especially benefit the long file transfers associated with storage applications. Every machine has FCoE adaptors that send the storage traffic using FC and networking traffic using Ethernet. These converged adaptors send fiber channel over Ethernet and standard Ethernet traffic over single wire from server. These Ethernet packets can be tunneled through the PCIe fabric for low-latency host-to-host communications within the local cluster or routed outside the PCIe domain using standard Ethernet PHY.

[Figure 3: Scale-Out Architectures]
Comparison of SMP Machines with Scale-Out Architectures
As shown earlier, SMP machines and their applications are very limited. These applications are being re-written so application can be divided and run in parallel on multiple CPU machines in a cluster. The 1U/2U x86 servers are becoming more popular and run very fast compared to proprietary SMP machines. The benefits of clustering depend largely on a high-speed interconnect, and tunneling Ethernet through PCIe allows PCIe to be used as a high-speed clustering interconnects.
The other reason for the popularity of clusters is the big push for Linux and other open-source software within the data centers. This trend will continue and expand as more applications are ported to Linux clusters.
PCI Express: The IO Interconnect of Current, Future Systems
PCIe has dominated the server and PC arena for years, taking over from the PCI standard, with which PCIe is software-compatible, that dates back to 1992. PCIe packetizes PCI transfers and implements them via point-to-point links using SERDES that today run at 5GB/sec.
In servers, PCIe is used to connect IO devices to root complex/chipsets that interface to the processors. PCIe switches from vendors such as PLX Technology are used to increase I/O fan out. Enhanced to support the MR IOV standard, those switches also can be used to connect multiple processors/root complexes to shared I/O devices to implement I/O virtualization, while retaining essential software compatibilities.
Figure 4 shows how PCIe is used to connect to multiple RC/chipsets to their IO devices.

[Figure 4: Typical Enterprise-Class Servers]
The IO devices are typically Ethernet and Fibre Channel, or FCoE, Myrinet, Quadrics or InfiniBand with gateways. These devices are typically dual-ported to connect to two fabrics for redundancy purposes. Adding Ethernet tunneling to the PCIe fabric allows it to be used as a cluster interconnects at the same time.
Tunneling Ethernet through PCI Express
Tunneling higher-level protocols over PCIe is a novel approach; with it, communications are achieved without any disruption to the existing protocol stacks. Newly defined PCIe multicast extensions, along with tunneling Ethernet through PCIe, shave off Ethernet NIC and Ethernet switch latency, while using the same interfaces on the host protocol stack. This approach provides huge benefits in today’s networks, where Ethernet dominates and is used as the network of choice for all types of inter-processor communication and external traffic.
This approach requires Ethernet end-point protocol engines integrated within PCIe switches on server cards.

[Figure 5: Ethernet Tunneling through PCIe & FCoE ]
Ethernet is tunneled between servers on the PCIe fabric while shared and virtualized standard FCoE devices provide external connectivity to networking and storage devices.

[Figure 6: Ethernet Tunneling For Host-To-Host Communications and Shared Virtualized FCoE Storage Devices]
Advantages of Ethernet Tunneling through PCI Express
There are many advantages to tunneling Ethernet packets through PCIe, including:
- PCIe has higher bandwidth than Ethernet (64 Gbps today vs. 10 Gbps);
- PCIe has lower latency compared to any other protocol;
- PCIe has the lowest per port pricing (1/10) compared to any higher-level protocol;
- Ethernet tunneling saves power and cost; and
- Ethernet tunneling uses the same software stack protecting software investment.
These advantages are compelling.
Unicast Traffic
Unicast traffic is sent from one processor to another. This type of traffic is destined to a specific target processor. This is the simplest of all the traffic being sent from the processor and it will be supported with Ethernet tunneling though PCIe.
Typically, Ethernet traffic is checked at layer-2 protocol before the destination has been recognized. This is done through a look up on {VLAN tag, Dest MAC address, Port} to figure out whether this packet belongs to destination route or not. VLAN tag provides the broadcast containment. Layer-2 switches provide this mechanism so one VLAN’s traffic does not appear on another VLAN. This is the reason why VLAN Tag is used along with DMAC to find the correct destination.
When {VLAN, DMAC, Port} misses in the layer-2 look up table, layer-3 is checked at the router layer to make sure that this packet does go to its destinations. If this misses then the packet takes the default route. The simplest form of layer-2 switching is used within data centers or near the server switches.
These layer-2 switches are simple to manage and provide an easy way to manage the servers and external networks.
Multicast Traffic
This type of traffic is controlled through various mechanisms on Ethernet. The simplest is IP multicast that requires various protocols to manage multicast.
Typically, IP multicast rides on Ethernet multicast packets. While IP multicast packets are sent with the last 24 bits of Multicast Group ID on Ethernet DMAC. Bit 40 in the DMAC is set to indicate that it is a multicast packet.
While multicasts are sent this way, there are other PIM and Densed PIM mechanisms that are applied on top of this mechanisms to make sure that there is no over lapping of these multicast addresses.
PCIe has added a multicast extension that supports this requirement for tunneling Ethernet protocol through PCIe switches.
Software Stack with Ethernet Tunneling
The following diagram shows the server software-stack. If an Ethernet controller is implemented as an IO end point in the PCIe switch, it can tunnel Ethernet through PCIe. A Layer-2 table within the switch is visited on per packet basis for routing. A multicast table will also be used when the Ethernet header calls for multicast. These tables are used to map Ethernet routing information into PCIe addresses for PCIe address routing.

[Figure 7: Host Software Stack]
As shown below, multiple hosts can communicate with each other using Ethernet tunneling feature of PCIe switches. This feature provides processor-to-processor communication.

[Figure 8: Host-to-Host communication]
HPC with IO Virtualization, FCoE
High performance computing is a separate market segment where scale-out architecture is replacing traditional SMP architectures. This architecture requires very high speed, high bandwidth low latency memory-to-memory copy interconnect between processors. Most of the technologies in this interconnect area are limited by 10 to 20 Gbps. PCIe Gen 2 x8 can provide up to 32 Gbps and Gen 3 will provide up to 64 Gbps. This kind of bandwidth at very low latency between processor to processor is ideal for HPC applications.
The HPC interconnect must support networking and I/O connectivity as well as interprocessor communications. The External IO connectivity could be supplied by Ethernet NICs or FCoE converged network adaptors (CNAs). The processor-to-processor communication will be through Ethernet devices, embedded in the PCIe switches, where all the processes and IP addresses are known apriori. These devices’ MAC addresses can be programmed in look up table of the PCIe switch. Also, ARP broadcast packets can be sent from a host and to all the upstream ports of the switch using PCIe multicast. Separate unicast ARP replies from each processor communicate their MAC addresses to the source of the ARP requestor and to the PCIe switches. When each application completes ARP, the PCIe fabric is fully configured for host-to-host communications.
IP multicast traffic is mapped onto Ethernet packets then tunneled through the PCIe fabric to multiple destinations using PCIe multicast.
The NAS devices connected to networking devices will provide network file virtualization. iSCSI traffic can run with software-controlled initiators.
Future Work on IO Virtualization, Ethernet Tunneling
More developments are underway with IO virtualization and Ethernet tunneling through PCIe. Many of the same examples discussed here can be extended to server and embedded systems to realize Gen 2/Gen 3 speed, latency and IO virtualization with Ethernet tunneling through PCIe.
Jack Regula is chief scientist and Shreyas Shah is chief systems architect at PLX Technology, Sunnyvale, CA.

