Link Search Menu Expand Document
calendar_month 23-Jan-25

Data Center Storage and Lossless Ethernet

HPE Aruba Networking data centers support Data Center Bridging (DCB) protocols that create lossless Ethernet fabrics to support storage area networks, big data analytics, and artificial intelligence (AI) applications.

Table of contents

Storage Over Ethernet Challenges

Traditional IEEE 802.3 Ethernet relies on higher layer protocols, such as TCP, to accommodate strategies for reliable data delivery. Data transmitted over an Ethernet network can be lost between source and destination hosts, which incurs a performance penalty on applications sensitive to data loss.

Storage performance is particularly sensitive to packet loss. TCP can guarantee data delivery at the transport layer by sequencing data segments and performing retransmission when loss occurs, but the need to perform TCP retransmissions for storage significantly reduces the performance of applications depending on that storage.

Advances in storage technology, such as SSD flash memory and the Non-Volatile Memory express (NVMe) protocol, facilitate read/write storage that exceeds the performance of traditional storage networking protocols, such as FibreChannel. The performance bottleneck in a storage area network (SAN) has moved from the storage media to the network.

Remote Direct Memory Access (RDMA) was developed to provide high-performance storage communication between two networked hosts using the proprietary InfiniBand (IB) storage network. IB guarantees medium access and no packet loss, and requires a special host bus adapter (HBA) for communication. The IB HBA receives and writes data directly to host memory using dedicated hardware, bypassing both traditional protocol decapsulation and the host’s primary CPU. This reduces latency, improves performance, and frees CPU cycles for other application processes.

**RoCE Data Transfer**

Ethernet solutions offer high-speed networking interfaces, making them attractive options for storage communication, if the reliability issue can be solved. RMDA over converged Ethernet (RoCE) is a protocol developed by the InfiniBand Trade Association (IBTA) to extend RDMA reliability and enhanced performance over a low-cost Ethernet network. A converged network adapter (CNA) performs the task of writing received data directly to memory and enables Ethernet as the underlying communication protocol. A lossless data communication path to support RoCE is created by modifying both Ethernet host and switch behavior.

RoCE version 1 (RoCEv1) encapsulates IB Layer 3 addressing and RDMA data directly into an Ethernet frame. Ethernet replaces RDMA Layer 1 and 2 functions, and it specifies a unique EtherType value to indicate RDMA as the Ethernet payload.

**RoCEv1 Ethernet Frame**

RoCE version 2 (RoCEv2) replaces IB Layer 3 addressing with IP. It encapsulates IB Layer 4 and RDMA data into a UDP header. This strategy makes RoCEv2 routable over IPv4 and IPv6 networks. RoCEv2 is the most common implementation of RoCE.

**RoCEv2 Packet**

The lossless Ethernet optimizations implemented in CX switches improve data center performance for applications using both RoCE and non-RoCE protocols such as standard iSCSI. In addition to storage communication, RoCE enhances the performance of database operations, big data analytics, and generative AI.

Non-Volatile Memory express (NVMe) is an intra-device data transfer protocol that leverages multi-lane data paths and direct communication to the CPU provided by PCIe to move large amounts of data at a high rate with low latency. NVMe is designed specifically for solid state drives (SSDs) as a replacement for the decades-old Serial Advanced Technology Attachment (SATA) protocol. NVMe over Fabrics (NVMe-oF) extends NVMe to work between networked hosts. NVMe-oF works over multiple protocols, including RoCE.

The primary challenge running RDMA over Ethernet is overcoming the problem of link congestion, the most common cause of dropped Ethernet frames in a modern Ethernet network. Link congestion occurs when frames are received at a faster rate on a switch than can be transmitted on an outgoing port. Link congestion has two common causes. First, the receive and transmit ports on a switch are different speeds, so the higher speed port can receive data faster than transmission to the lower speed port allows. Second, a switch receives a large number of frames on multiple interfaces destined to the same outgoing interface. In both cases, the switch can queue surplus frames in a memory buffer until the outgoing port is able to transmit them. If buffer memory becomes full, additional incoming frames are dropped as long as the buffer remains full. This results in TCP retransmissions and poor application performance.

Building Reliable Ethernet

A lossless Ethernet fabric can be created by connecting a contiguous set of switches and hosts that employ a set of strategies to prevent frame drops for a particular application.

Three primary Quality of Service (QoS) strategies manage competing demands for buffer memory and switch port bandwidth: dedicated switch buffers for an application, flow-control, and guaranteed media access for an application. Combining these three strategies enables a lossless Ethernet fabric for storage and other applications.

The following table displays the key DCB protocols supported by HPE Aruba Networking CX data center switches.

Data Center Bridging ComponentDescription
PFC: Priority Based Flow ControlEstablishes queues that do not drop packets by preventing buffer exhaustion.
ETS: Enhanced Transmission SelectionDefines bandwidth reservations for traffic classes so that lossless and lossy traffic can coexist on the same link.
DCBx: Data Center Bridging Exchange ProtocolExchanges PFC and ETS information between devices on a link using Link Layer Discovery Protocol (LLDP) to simplify configuration.

In addition to the protocols above, CX switches support IP Explicit Congestion Notification (ECN). IP ECN is a Layer 3 flow-control method that allows any switch in the communication path to notify a traffic receiver of the presence of congestion. After receiving a congestion notification, the receiving host sends a direct, IP-based congestion notification to the traffic source to slow its data transmission rate.

Enhancements in RoCE have produced two different versions. RoCEv1 relies on the base DCB protocols in the table above and is not supported over a routed IP network. RoCEv2 enables IP routing of RoCE traffic, includes IP ECN support, and is the protocol version most often referenced by the term “RoCE.”

Priority Flow Control

Ethernet pause frames introduced link-level flow control (LLFC) to Ethernet networks in the IEEE 802.3x specification. When necessary, a traffic receiver can request a directly connected traffic source to pause transmission for a short period of time, allowing the receiver to process queued frames and avoid buffer exhaustion. The traffic source can resume transmitting frames after the requested pause period expires. The receiver also can inform the source that a pause is no longer needed, so the source can resume transmitting frames before the original pause period expires.

Priority Flow-Control (PFC) works in conjunction with quality of service (QoS) queues to enhance Ethernet pause frame function. PFC can pause traffic on a per-application basis by associating applications with a priority value. When PFC pauses traffic associated with an individual priority value, traffic assigned other priorities are unaffected and can continue to transmit.

On a link, both the CX switch and attached device must locally assign a priority to application traffic and indicate that priority to its peer on the link. Traffic priority can be signaled using either 802.1p Priority Code Point (PCP) values or Differentiated Services Code Point (DSCP) values.

PCP Priority Marking

The IEEE 802.1Qbb standard uses 802.1p PCP values in an 802.1Q header to assign application traffic priority. The three-bit PCP field allows for eight Class of Service (CoS) priority values (0-7). PCP-based PFC requires the use of trunk links with VLAN tags to add an 802.1Q header to a frame.

The diagram below illustrates the PCP bits used to specify 802.1p CoS priorities in the 802.1Q header of an Ethernet frame.

**802.1Q Header Diagram in Ethernet Frame**

By default, there is a one-to-one mapping of CoS priorities to local priorities on the switch used for frame queueing.

DSCP Priority Marking

Lossless behavior between two data center hosts requires that both hosts and all switches in the data path have a consistent PFC configuration. If a routed-only interface is in the data path, application priority can be preserved by specifying a priority using the DSCP bits in the IP header. DSCP bits also can be used to mark application traffic priority on both 802.1Q tagged and untagged switch access ports.

The diagram below illustrates the DSCP bits located in the legacy Type-of-Service (ToS) field of the IP header.

**DSCP bits in ToS Field of IP header**

The six-bit DSCP field allows for 64 DiffServ priority values. By default, DiffServ values are mapped in sequential groups of eight to each of the eight individual local-priority values.

CX switches support a mix of CoS and DSCP priority values by allowing each interface to specify which QoS marking method is trusted. When a mix of strategies is present on different switch ports, traffic must may require re-marking between Layer 2 CoS priority values and Layer 3 DSCP values.

Responding to the growth of routed spine-and-leaf data center architectures and VXLAN overlays, an increasing number of hosts and storage devices support DSCP-based priority marking. This enables consistent QoS markings across a routed domain without the need to translate between Layer 2 CoS values and Layer 3 DSCP values on network switches.

In addition to CoS and DSCP values, CX switches can apply a classifier policy to ingress traffic to assign priorities (PCP, DSCP, and local) based on header field values in the packet.

When a frame is encapsulated for VXLAN transport, the QoS DSCP priority of the encapsulated traffic is honored in the outer VXLAN packet’s IP header to ensure proper queueing.

PFC Operations

CX data center switches support a special shared QoS buffer pool dedicated for lossless traffic. The CX 8325, 10000, and 9300 models support up to three lossless pools. Typically, only one lossless queue is defined for storage traffic. Each lossless pool is assigned a size, headroom capacity, and associated local priority values. The buffers assigned to a lossless pool are allocated from the total available buffer memory on the device, which are assigned to a single lossy pool by default. The CX 8100 and 8360 support a single, fixed lossless pool for smaller data centers.

Received frames are assigned a local priority value based on a mapping of PCP and DSCP values to local priority values. A frame is placed into the special lossless buffer pool when its local priority value is associated with a lossless queue. When a port’s allocation of the shared lossless buffer pool nears exhaustion, packet drops are avoided by notifying the directly-connected sender to stop transmitting frames with the queue’s associated priority value for a short period of time. The headroom pool stores packets that arrive at the interface after a pause in transmission was requested for the associated priority.

PFC support is included on the CX 8325, 9300, 10000, 8360, and 8100. However, traffic arriving on a CX 10000 with a QoS priority associated with a lossless queue will not be sent to the AMD Pensando Distributed Processing Unit (DPU) for policy enforcement or enhanced monitoring.

The diagram below illustrates the queuing relationship between a sender and a CX switch receiver with two queues defined using CoS priority values. All priorities are mapped to the default lossy queue or to a single lossless queue. Using two queues on the CX platform provides the best queue depth and burst absorption.

**802.1p Priority Based PFC**

A PFC pause notification briefly stops transmissions related to a single application by its association with a priority queue number.

**802.1p Priority Based PFC**

Storage is the most common application for lossless Ethernet. Applying the diagram above to a storage scenario, all storage traffic is assigned a PCP value of 4, which is mapped to local-priority 4. When storage traffic is received on the CX switch, it is placed in the lossless QoS queue dedicated for storage. Traffic assigned to the lossy queue does not impact buffer availability for the lossless storage traffic. When the lossless storage queue on the CX switch reaches a threshold nearing exhaustion, a pause frame is sent to inform the sender to pause only storage traffic. All other traffic from the sender continues to be forwarded and is placed in the shared lossy queue on the CX switch, if buffers are available.

PFC is the preferred flow-control strategy, but it requires data center hosts to support marking traffic priority appropriately. PFC is built into specialized HBAs and is required for RoCE compliance.

LLFC can enable lossless Ethernet when implemented in combination with other QoS components for prioritization, queueing, and transmission. Many virtual and physical storage appliances do not support PFC or other DCB protocols, but LLFC is widely supported on most standard Ethernet network interface cards (NICs). Implementing LLFC extends the benefits of lossless data transport to hosts that do not support PFC and for non-RoCE protocols.

All traffic received on a switch port using LLFC is treated as lossless. It is recommended to minimize sending lossy traffic from a host connected to a link using LLFC.

When a CX switch sends an LLFC pause frame to an attached device, it pauses all traffic from that source instead of from a single targeted application. The pause in transmission gives the switch time to transmit frames in its lossless queues and prevents frame drops.

Application traffic priority is typically not signaled from a source limited to link-level flow control. In place of the source marking traffic priority, a classifier policy is implemented on the CX ingress port to identify application traffic that should be placed in a lossless queue by matching defined TCP/IP characteristics. When an interface also trusts DSCP or CoS priority values, the trusted QoS markings are honored and take precedence over a custom policy prioritization.

Enhanced Transmission Selection (ETS)

ETS allocates a portion of the available transmission time on a link to an application using its association with a priority queue number. This helps to ensure buffer availability by guaranteeing that the application traffic has sufficient bandwidth to transmit queued frames. This behavior reduces the probability of congestion and dropped frames.

Allocation of bandwidth is divided among traffic classes. ETS is implemented on CX switches using QoS scheduling profiles, where locally defined queues are treated as a traffic class. Traffic is associated with a queue by associating it with a local priority value. Traffic can be mapped to local priorities based on DSCP priorities, CoS priorities, or TCP/IP characteristics using a classifier policy.

CX 8325, 10000, and 9300 switches apply a deficit weighted round robin (DWRR) strategy to calculate a queue’s bandwidth allocation by applying a weight to each queue in a scheduling profile. The following example shows the resulting percentage of bandwidth associated with a queue for the collective set of weights.

Queue NumberWeightGuaranteed Bandwidth
Queue 0 (Lossy)840%
Queue 1 (Lossless)1050%
Queue 2 (Lossless)210%

In the example above, storage traffic can be assigned to queue 1, which guarantees storage traffic the ability to consume up to 50% of the link’s bandwidth. When a class of traffic is not consuming its full allocation, other classes are permitted to use it. This enables the link to operate at full capacity, while also providing a guaranteed allocation to each traffic class. When a link is saturated, each class can consume only the bandwidth allocated to it based on the assigned weights.

Multiple scheduling profiles can be defined, but an individual port is assigned a single profile that governs its transmission schedule.

The following diagram illustrates traffic arriving on a switch, being placed in a queue, and the reserved bandwidth per queue of the outgoing port. Scheduling enforcement occurs when the outgoing port is saturated and the ingress rate for each traffic class meets or exceeds the reserved bandwidth configured on the outgoing port.

**ETS Bandwidth Reservation**

When the outgoing port is not oversubscribed, its transmission rates may be different. The unused bandwidth allocations in one class may be consumed by another class. For example, if the port is transmitting at 75% of its capacity, where 60% is from queue 0, 20% is from queue 1, and 5% is from queue 2, the switch does not need to enforce the scheduling algorithm. The lossy traffic in queue 0 is allowed to consume the unused capacity assigned to other traffic classes and transmit at a higher rate than the schedule specifies.

Data Center Bridging Exchange (DCBx)

DCBx-capable hosts dynamically set PFC and ETS values advertised by CX switches. This ensures a consistent configuration between data center hosts and attached switches. DCBx also informs compute and storage hosts of application traffic to priority mappings, which ensures that traffic requiring lossless queuing is marked appropriately. Lossless Ethernet configuration on connected hosts becomes a plug-and-play operation by removing the administrative burden of configuring PFC, ETS, and application priority mapping on individual hosts.

DCBx is a link-level communication protocol that employs Link Layer Discovery Protocol (LLDP) to share settings. PFC, ETS, and application priority settings are advertised from the switch using specific LLDP Type-Length-Value (TLV) data records. CX switches set the willing bit to 0 in all TLVs to indicate that it is not willing to change its configuration to match its peer’s configuration. CX switches support both IEEE and Convergence Enhanced Ethernet (CEE) DCBx versions.

IP Explicit Congestion Notification (ECN)

IP ECN is a flow-control mechanism that reduces traffic transmission rates between hosts when a network switch or router in the path signals that congestion is present. IP ECN can be used between hosts separated by multiple network devices and on different routed segments. It is required for RoCEv2 compliance.

Hosts that support IP ECN set one of two reserved Type of Service (ToS) bits in the IP header to a value of 1. When a switch or router in the communication path experiences congestion, it sets the remaining zero ECN bit to 1, which informs the traffic receiver that congestion is present in the communication path.

**ECN bits in ToS Field of IP header**

When the traffic receiver is notified of congestion, it signals this to the source by sending an IP unicast message. The source responds by reducing its data transmission rate for a brief period of time.

**RoCEv2 IP ECN Process**

IP ECN smooths traffic flows under most conditions, reducing the need for PFC to trigger a full pause, except as a fast acting mechanism to address microbursts.

IP ECN also can be implemented to improve the performance of non-RoCE protocols, such as iSCSI.

Data Center Networking for AI

Introduction

Artificial Intelligence (AI) has revolutionized industries, driving exponential growth in applications. However, AI workloads demand high-performance computing, low-latency networks, and scalable storage. To address these requirements, an AI-optimized data center network is designed and built on a comprehensive framework for AI-supportive networks.

Traditional data center technologies struggle to meet AI workload demands, necessitating cutting-edge solutions in compute, storage, and networking. Specialized AI data centers require tailored back-end training and front-end inference fabric designs. While Graphics Processing Units (GPUs) and InfiniBand networking have emerged as key technologies, the single-sourced, proprietary nature of InfiniBand drives up costs. In response, enterprises are embracing Ethernet as a cost-effective, open networking alternative for AI data centers, optimizing GPU performance while minimizing costs.

To accelerate AI adoption, data center networks play a vital role in optimizing GPU interconnectivity and performance. Reducing Job Completion Time (JCT) is critical for faster speed and cost savings. Furthermore, rapid market response to demand is essential for successful AI deployment. In response, the industry is shifting towards an open, competitive market fueled by GPU diversity and Ethernet, the widely deployed Layer 2 technology. This transition promises to alleviate reliance on single-vendor solutions, promoting flexibility, scalability, and cost-effectiveness.

The Data Center Network for GenAI

Advances in GenAI have made AI and machine learning (ML) a new part of corporate business tools. Data centers are the engines behind AI, and data center networks play a critical role in interconnecting and maximizing the utilization of expensive GPU servers.

GenAI training, measured by job completion time (JCT), is a massive parallel processing problem. A fast and reliable network fabric is needed to get the most out of the GPUs. The right network and design are key to optimizing ROI and maximize savings on AI applications.

A typical AI-Optimized DC Network for Generative AI workload consists of:

  • Compute Nodes: High-performance servers with AI-optimized processors (e.g., GPUs, TPUs)
  • Storage: High-capacity storage with low-latency access (e.g., NVMe, SSD)
  • Networking: High-density,low-latency switches and AI-optimized network protocols
  • Software: AI frameworks (e.g., TensorFlow, PyTorch) and network management tools

Generative AI Best Practice Architecture

The architecture for AI best practice includes frontend, backend and storage fabrics. These fabrics have a symbiotic relationship, and provide unique functions in the training and inference tasks in this architecture.

**GPU arch**

Front-End Network

The frontend network for GenAI plays a critical role in ensuring high-performance, low-latency connectivity for AI and machine learning (ML) workloads. Design considerations include utilizing high-speed Ethernet switches, such as 100GbE or 400GbE, to interconnect AI servers and storage. Additionally, implementing EVPN-VXLAN enables efficient traffic management and scalability. HPE Aruba networking CX series switches, combined with HPE Central and AFC management tools, provide a robust and automated solution. By adopting these best practices, organizations can build a high-performance frontend network optimized for AI workloads.

Storage Network

A high-performance storage fabric is crucial for AI and machine learning (ML) workloads, requiring low-latency and high-bandwidth connectivity. HPE storage fabric solution utilizes CX series switches, enabling 100GbE or 200GbE connectivity to the storage arrays. Protocols like RoCEv2, NVME are utilized on a converged ethernet fabric solution to provide lossless transport for storage traffic.

GPU Clusters

GPU clusters, also referred to as GPU Fabric provide massive parallel computing power needed to process large datasets and complex neural networks rapidly, accelerating the training time and enabling fast, efficient inference on new data. GPUs are designed with thousands of cores that can perform calculations simultaneously, ideal for the parallel processing required in training large Gen AI models. With the parallel processing capabilities of GPU’s, the time needed to train complex GenAI Large Language Models (LLM) are significantly reduced. GPU clusters can be scaled up by adding more GPU nodes to handle larger and more complex datasets, as needed.

Back-end Network

Backend networks are specialized networks connecting GPU clusters for distributed Large Language Model (LLM) training, enabling high-bandwidth data transfer and efficient parallel computation. This would require a high-performance Rail-Optimzed or Rail-Only architected network, featuring low latency, robust design, no link oversubscription between workload and fabric, and lossless Ethernet DC fabric

HPE CX series switches (100/200/400GbE) address high-bandwidth and low-latency demands. AI-optimized network protocols, such as Global Load Balancing (GLB for end-to-end load balancing), efficiently manage elephant flows inherent in AI workloads. Network automation streamlines management and configuration, adapting to dynamic AI workloads. Additionally, robust security measures safeguard AI systems and data, protecting against vulnerabilities and threats.

Overview of GPU Server and Interconnects

GPU servers utilize internal high-bandwidth PCIe switches for efficient interconnectivity, facilitating communication between key components: CPU to GPU, GPU to NIC (Network Interface Card), and NIC to NIC bidirectional communication. As illustrated in the GPU Server architecture schematic, these servers employ very high-bandwidth NIC adapters (100/200/400/800G) to network and scale a large group of GPUs. This architecture is specifically designed to support the demanding requirements of training Large Language Models (LLMs).

**GPU arch**

The NVIDIA NVSwitch is a high-speed switch chip (>900GB/s) connecting multiple GPUs through NVLink interfaces. It enhances intra-server communication and bandwidth while reducing latency for compute-intensive workloads.

Backend GPU Fabric for GenAI Training

The Backend Data Center fabric consist of multiple GPUs connected to form a cluster, enabling distributed training and model parallelism. Distributed training splits the training process across multiple GPUs, while model parallelism divides the model into smaller parts processed by different GPUs. Communication protocols like NCCL (NVIDIA Collective Communication Library) and MPI (Message Passing Interface) facilitate efficient communication between GPUs.

LLM training involves massive datasets and complex models, requiring thousands of GPU server nodes to process efficiently. High-performance networking (100/200/400GbE) ensures scalable and fast data transfer. Real-time communication between nodes is crucial for parallel processing. Low latency (<100 µs) enables faster iteration and convergence. LLM training is computationally intensive and time-consuming. Redundancy and failover mechanisms guarantee uninterrupted processing. Quality of Service ensures critical tasks receive sufficient bandwidth, optimising overall performance.

Packet loss and congestion significantly impair Large Language Model (LLM) training, reducing throughput, increasing latency and affecting model accuracy. This results in high Job completion time (JCT). Reliable networks should guarantee minimizing packet loss and congestion to ensure timely and accurate results.

RoCEv2 protocol was designed to provide a low cost Ethernet alternative to the InfiniBand Networks. It leverages RDMA (Remote Direct Memory Access) technology to bypass CPU overhead, reducing latency and increasing throughput. This protocol increases data delivery performance by enabling the GPUs directly access its memory as shown in figure. RoCEv2 uses PFC and ECN protocols to implement lossless behavior in the data center.

**rail-optimized arch**

The Backend GPU Fabric also enables Global Load Balancing techniques on the network links to deploy a congestion free environment to reduce latency and to support the low entropy elephant flows that are typical in a GPU message exchange communications. The low entropy elephant flow needs better load balancing than the typical ECMP that is used in enterprise DC networks. There are solutions that achieve this in the switch hardware chipset and would perform at the highest efficiency. Software driven solutions can be equally good for small/medium LLM size training.

When designing backend fabrics, the primary objective is to achieve a lossless architecture that maximizes throughput, minimizes latency, and network interference for AI traffic flows. To accomplish this, several design architectures are available, including 3-stage CLOS, NVIDIA’s Rail-Optimized, and Rail-only architectures. These architectures leverage leaf switches, spine switches and super-spine switches. Notably, Rail-only simplifies design by utilizing solely leaf switches to build the backend fabric

Rail-Optimized Backend GPU Fabric

This architecture consists of leaf-spine network switches as shown in the figure. The GPU servers are connected to the leaf/spine following the rail-optimized technique. Rail-optimized archectecure is a design methodology that organizes data centers infrastructure into logical rails or paths, ensuring efficient data flow. The figure below shows a groups 8 GPU servers with 8xGPUs each. A full rail stripe populate as many GPUs as the number of ports the leaf has to create optimized paths for GPU to GPU communication

The first GPU from each (NVIDIA DXG H100) GPU server will all connect to the leaf1 switch, and all second GPUs to Leaf2 and so on. This helps optimize RDMA message flow between GPU servers by using just the Leaf switch when the communication is within the 8xGPU servers. The spine is used only when the message needs to cross over to the next logical rail of GPU servers. The figure next to following figure shows a fully populated GPU servers that shows two logical rail stripes.

**rail-optimized arch**

Here both leaf and spine switches have 64x400GbE ports. With 1:1 subscriptions 32x400GbE ports will connect to the spines and the other 32x400GbE can connect to 32xGPU servers. The spines uses 32x400GbE ports to connect to each rail stripe.

**rail-optimized arch**

The fabric implement Lossless Ethernet transport, and uses RoCEv2, PFC, ECN protocols to make sure no packets are dropped. It enables Global Load Balancing techniques on the network links to deploy a congestion free environment to reduce latency and to support the elephant flow that are typical in a GPU communications. The low entropy elephant flow needs better load balancing than the typical ECMP that is used in enterprise DC networks.

Rail-Only Architecture

The Rail-Only architecture differs significantly from Rail-optimized, eliminating spine nodes and utilizing solely leaf nodes to connect GPU servers. As illustrated, this design employs eight 64x400GbE switches, supporting up to 64 GPU servers. To scale GPU capacity, higher-port-density switches are required, limiting this architecture’s scalability for large GPU clusters. However, Rail-Only suits small enterprises with smaller training LLM sizes that fit comfortably within this framework.

The fabric implement Lossless Ethernet transport using RoCEv2, PFC, ECN protocols. It enables Global Load Balancing techniques on the network links to deploy a congestion free environment to reduce latency and to support the elephant flow that are typical in a GPU communications. The low entropy elephant flow needs better load balancing than the typical ECMP that is used in enterprise DC networks.

**rail-only arch**

The Rail-Only network architecture offers significant cost savings, primarily due to the elimination of spine switches. Compared to Rail-Optimized, this design reduces capital expenditures (CapEx) by approximately 50-75%.

Validation Criteria

  • Throughput: 100GbE minimum throughput per port
  • Latency: <10 µs latency for real-time AI applications
  • Scalability: Support for 1000+ AI nodes or applicable
  • Security: Compliance with industry-standard security protocols

Conclusion

Among various designs being explored for backend training networks, Rail-Optimized and Rail-Only architectures stand out as cost-effective solutions offering balanced performance. For GenAI training backend data center fabric, hardware-based solutions provide optimal performance but come at a high cost. Software-driven approaches offer a more effective balance between features and deployment costs.

Small and Medium Enterprises (SMEs) seeking compact deployment footprints can leverage Rail-only architecture for efficient training within reasonable timelines. This approach reduces capital expenditures (CapEx) by approximately 50% compared to Rail-Optimized architecture, primarily by eliminating spine switch costs.

HPE Aruba Networking offers a comprehensive portfolio, featuring 100/200/400GbE CX series switches, powered by the AOS-CX network operating system. This combination provides a robust, lossless solution, supporting essential protocols and features for efficient data center connectivity.

Unlock the full potential of your AI initiatives with our expert-designed networking architecture, optimized for high-performance, scalability and security. Maximize throughput and minimize latency, scale effortlessly and protect your data with robust security measures. This transformative solution empowers businesses with faster insights from AI applications, enhanced collaboration and productivity, and a future-proof infrastructure, giving you the competitive edge you need to succeed.

Storage Positioning

Storage in a data center is typically deployed as a SAN, part of hyper-converged infrastructure (HCI), or as disaggregated HCI (dHCI).

SANs comprise one or more dedicated storage appliances that are connected to servers over a network. A proprietary network using storage based protocols, such as FibreChannel, can be used to connect servers to storage. However, IP-based solutions over an Ethernet network provide a high-bandwidth, low-cost option, with accelerating adoption levels. Common IP-based SAN protocols include iSCSI and RoCE.

HCI decouples the storage and compute capabilities of off-the-shelf x86 infrastructure, providing a cloud-like resource management experience. Each x86 host in the HCI environment provides both distributed storage and compute services. The local storage on an HCI cluster member can be used by any other member of the cluster. This provides a simple scaling model, where adding an additional x86 node will add both additional storage and compute to the cluster.

The HPE SimpliVity dHCI solution divides compute and storage resources into separate physical host buckets to allow flexible scaling of one resource at a time. In the traditional HCI model, both storage and compute must be increased together when adding an x86 node. This can be costly if only an increase in one resource is required. For example, if additional compute is required and storage is already adequately provisioned in an HCI solution, significant additional storage is still added to the cluster regardless of the need. dHCI supports scaling compute and storage individually, while using x86 hardware for both compute and storage services.

All the above storage models improve performance when using lossless Ethernet.

Parallel Storage Network

Traditionally, a storage network is deployed in parallel with a data network using proprietary network hardware and protocols to support the reliability needs of storage protocols such as FibreChannel and InfiniBand. TCP/IP-based storage models enabled the migration to lower-cost Ethernet-based network infrastructure, and using a parallel set of storage Ethernet switches became a common method of avoiding competition between storage and data hosts for network bandwidth.

When implementing a dedicated storage network over Ethernet, congestion resulting in dropped frames can still occur, so deploying the full suite of Layer 2 DCB protocols (DCBx, PFS, and ETS) is recommended to maximize storage networking performance.

The diagram below illustrates a dedicated Ethernet storage network deployed in parallel to a data network. Lossless Ethernet protocols are recommended even when using a dedicated storage network.

**Parallel Storage Network**

Converged Data/Storage Network

High speed top-of-rack (ToR) switches with high port density facilitate the convergence of storage and data networks onto the same physical Ethernet switch infrastructure. An organization can maximize its budgetary resources by investing in a single network capable of handling data and storage needs.

A converged storage and data network requires queueing and transmission prioritization to ensure that network resources are allocated appropriately for high-speed storage performance. IP ECN provides additional flow-control options to smooth traffic flow and improve performance. DCBx is beneficial to automate PFC and ETS host configuration.

The diagram below illustrates protocols and positioning to achieve lossless Ethernet in a two-tier data center model.

**Parallel Storage Network**

A spine and leaf network architecture allows linear scaling to reduce oversubscription and competition for network resources. This is achieved by adding spine switches to increase east-west network capacity. Spine and leaf networks use Layer 3 protocols between data center racks, which requires mapping 802.1p priorities to DSCP values to ensure consistent QoS prioritization of traffic across the network infrastructure.

iSCSI

iSCSI is one of the most prevalent general purpose SAN solutions. Standard iSCSI is TCP-based and supports routed IP connectivity, but initiators and targets are typically deployed on the same Layer 2 network. Lossless Ethernet is not a requirement for iSCSI, but it can improve overall performance. Many iSCSI storage arrays using 10 Gbps or faster network cards support PFC and ETS.

When PFC is not supported, LLFC can be used to achieve a lossless Ethernet fabric. Separate switching infrastructure can be implemented to avoid competition between storage and compute traffic, but lossless Ethernet enables the deployment of a single converged network to reduce both capital and operating expenditures.

The following diagram illustrates the components of a converged data and iSCSI storage network.

**iSCSI L2 Lossless Topology**

High Availability

Applications using lossless Ethernet are typically critical to an organization’s operations. To maintain application availability and business continuity, redundant links from Top-of-Rack (ToR) switches provide attached hosts continued connectivity in case of a link failure. Use a data center network design that provides redundant communication paths and sufficient bandwidth for the application. The Data Center Connectivity Design guide details network design options.

CX Switch Lossless Ethernet Support

The following illustration summarizes HPE Aruba Networking CX data center switch support for lossless Ethernet and storage protocols, and the feature requirements for common storage protocols.

**CX Switch Protocol Support Matrix**

HPE Storage Validation for CX Switches

Single Point Of Connectivity Knowledge (SPOCK) is a database that compiles validated compatibility for HPE Storage components, including CX switches. HPE Aruba Networking CX 8325 and CX 9300 series switches have been SPOCK validated and approved by the HPE Storage Networking Team.