For some, the move to the 5 Gbps PCI Express (PCIe) Gen 2 may not provide sufficient performance improvement. When an application involves the transmission of information at a high rate to multiple destinations, dualcast and/or multicast can provide a further boost. Multicast has been defined as “the delivery of information to a group of destinations simultaneously using the most efficient strategy to deliver the messages over each link of the network only once, creating copies only when the links to the destinations split.” Dualcast is a subset of multicast limited to two recipients.
The definition of multicast given above implies that the information to be replicated is sent to the switch just once. The switch then clones the dualcast/multicast packets for each destination. Doing this saves bandwidth both in memory at the source where the information initially resides and on the PCIe link to the switch. When the information to be replicated dominates the transmission requirements, the use of dualcast can save significant cost or power by allowing a narrower PCIe link to be used or may allow a product to scale to the next generation performance requirements without an architectural change.
PCIe switch vendors have begun to implement dualcast in their products. PLX Technology’s Gen 2 switches feature its Dual Cast implementation while also supporting non-transparent bridging, making them ideal for redundant storage controllers that implement redundant disk write buffers.
Switch Implementations of Dualcast
A switch must know how to identify a packet to be dualcast or multicast and where to send each copy. In PCIe, it is natural to encode this information in the address and to define a capability structure that allows the addressing and routing information to be configured. PLX’s Gen 2 switches include a number of dualcast Base Address Registers (BARs). For each posted packet that hits in a dualcast BAR, a copy is created and sent to a configurable port with an address translation defined in an associated Translation Register. The original packet continues to its normal, address routed destination. The copy is routed to a different destination by virtue of its translated address. Because of the address translation, legacy devices can be used as targets of the dualcast and indeed can’t distinguish dualcast packets from unicast packets.
In this dualcast mechanism, little or no change to the system memory map is required to accommodate dualcast. The user simply has to decide which address regions are to be copied to a second range via the dualcast mechanism to land in a different destination endpoint. Furthermore, the entire dualcast mechanism is implemented and controlled via a single capability structure in a single switch and needs no other hardware support elsewhere in the system.
Efficient implementations of dualcast or multicast retain a single copy of the packet in an internal buffer until it has been sent to multiple locations instead of physically replicating the packet prior to transmission. With appropriate buffer architecture, this can be done without blocking traffic addressed to other ports when one of the dualcast target ports is backlogged, unless or until all of the switch’s buffer memory is consumed.
There is a ready fit between the process of transmitting a packet from a single buffer location multiple times (and, ideally, simultaneously) and PCIe flow control. The switch implements a check-off mechanism that notes when a data link layer acknowledgement packet has been received from the link partner of every egress port out of which the packet has been transmitted. Only then does the switch de-allocate the buffer location containing the original packet, potentially allowing a flow control update to be sent to the link partner at its ingress link. At each egress port, PCIe’s retry protocol operates to guarantee error-free transmission of each copy.
At the system level, some adjustments need to be made. If information coming in an ingress port is forwarded out multiple egress ports, then the traffic rate at other ingress ports must be reduced below wire speed in proportion to the multicast traffic and the replication factor of the dualcast/multicast or congestion will result. Some applications have an inherent rate limit below the threshold of congestion. Others may require explicit source rate limiting mechanisms.
Dualcast Applications
The primary motivation for the use of dualcast is either efficiency or bandwidth conservation. Nevertheless, dualcast applications may be further categorized by whether the intent is simply to accelerate an application or to provide an efficient means of implementing redundancy. Dual-headed graphics is an example of use of dualcast purely for acceleration. Perhaps the best example of the use of dualcast in a redundant system is the storage controller diagrammed in Figure 1.
Redundant Storage Controller
High-availability considerations dictate the implementation of a storage controller as two identical field-replaceable units, or FRUs. Each FRU has an embedded processor, memory used as a disk write buffer, a host interface ASIC and a disk drive controller. The two FRUs are interconnected via PCIe and non-transparent bridging to implement host failover and mirroring of the disk write buffers.

Figure 1 -- Redundant storage controller using PCIe switches with non-transparent bridging and Dual Cast
The external host interface may employ Fibre Channel, Ethernet, or InfiniBand protocol. Whatever the protocol, the designers are under pressure to make it run at wire speed and to upgrade the speed of the wire at each generation. This task is made more difficult by the fact that data integrity requirements force packets ingress at the host interface must be sent upwards into the controller twice, essentially doubling the link bandwidth requirements if dualcast isn’t used.
Packets coming into the host interface ASIC from its external connections contain disk write data. This is first stored in write buffers in the local CPU’s RAM and later—sometimes much later—written to disk via the drive controllers. To protect against data loss, the contents of the write buffers are mirrored in each FRU by using dualcast of the DMA write from the host interface ASIC to memory. The dualcast paths are shown in dashed red and blue lines in Figure 1. Previously, these transmissions were done by replicated unicast. The DMA controller in the host interface ASIC was simply programmed to move the data twice. With dualcast, this becomes a single operation, perhaps allowing a doubling of speed of the external host link with only modest changes to the internal architecture.
Once data has been written into both CPUs’ memories, it remains there until an opportune time to write it to disk. Once so written, the memory image can be freed up. The primary processor can inform the secondary processor that this has been done by writing through the non-transparent bridges to a completion queue. In the event of a failure of the primary processor, the secondary processor can take over seamlessly because it has a copy of all the data and knows precisely how far the primary processor progressed before it failed. The secondary processor can infer that the primary processor has failed if it doesn’t receive some number of heartbeat messages again via the PCIe link and non-transparent bridge on schedule.
Dual-headed Graphics
Dual-headed graphics for high-end gaming systems illustrates the use of dualcast purely for acceleration. In such systems, two GPUs collaborate to paint a single screen. One is responsible for the top half of the screen and the other for the bottom. Images displayed in these systems are predominately computer-generated animations. The graphics engines draw wireframes from vector lists, finely divide surfaces into triangles and then render the triangles with solid colors. In such a scenario, the vector list and other commands can be dualcast to both GPUs. After this, each GPU will draw the portion of each vector that falls into its half of the screen. Image frames are created continuously with a tradeoff between time spent rendering and a desire to maximize the smoothness of motion in the images. Each time a new image is computed, the bitmap of half of the image is copied through the switch from one GPU to the other. The use of dualcast increases the rate at which a frame can be created and thus the smoothness of any motion in the image.

Figure 2 -- Dual-headed graphics with a 48-lane PCIe switch
Summary
Storage controllers are prime examples of the use of dualcast to support redundant data transmission requirements. Systems with multiple graphics engines can employ dualcast purely for acceleration. Diverse applications that formerly required replicated unicast packet transmissions now can be accelerated by the use of dualcast over a PCIe system interconnect. Additional opportunities for such acceleration will arise when the implementation of multicast over PCIe is standardized by the PCI-SIG later this year.
Jack Regula is chief scientist at PLX Technology.



