Design and Evaluation of Communications Protocols over InfiniBand WAN

Overview

Lower costs coupled with the rapid pace of hardware (CPU, memory, etc.) development is driving organizations into deployment of multiple clusters over time. While this increasing number of these loosely coupled cluster deployments has certainly improved the computation capabilities of these organizations, tighter integration with better communication protocols would certainly be expected to pave the way for significantly higher performance gains in such cluster-of-cluster scenarios. InfiniBand WAN promises to be such an enabling communication interconnect allowing its high performance protocols be extended to WAN (i.e. cluster-of-cluster) scenarios.

Objectives

  • Study the characteristics of InfiniBand WAN in the context of cluster-of-cluster scenarios.
  • Identify the potential benefits and limitations of the various advanced features of InfiniBand and basic InfiniBand-based designs.
  • Study the effect of WAN characteristics like high delay on various protocols.
  • Leverage the advanced capabilities of IB WAN to design efficient communication extensions for popular WAN protocols such as HTTP and FTP.
  • Evaluate and propose optimizations for communication protocols used by various middleware libraries in the context of both point-to-point communication and collectives.
  • Study the effect of WAN delay on end applications and quantify the extent of the benefits of the proposed optimizations and possible overlap potential therein.

Description

The following figure shows our testbed. We currently connect the two clusters shown using a pair of Obsidian Longbow XRs.


This research is supported in part by NSF grants #CNS-0403342 and #CNS-050942; and equipment donations from Intel, Mellanox and Obsidian.

Results

In our experiments we use nodes of the following cluster connected by a pair of Obsidian Longbow XRs. The cluster consists of 64 Intel Xeon Quad dual-core processor nodes with 6GB RAM. The cluster nodes are equipped with IB DDR memfree MT25208 HCAs and OFED 1.3 drivers were used. Some nodes of the cluster are also equipped with Mellanox ConnectX IB adapters. Performance of HPC Middleware over InfiniBand WAN

High performance interconnects such as InfiniBand (IB) have enabled large scale deployments of High Performance Computing (HPC) systems. High performance communication and IO middleware such as MPI and NFS over RDMA have also been redesigned to leverage the performance of these modern interconnects. With the advent of long haul InfiniBand (IB WAN), IB applications now have inter-cluster reaches. While this technology is intended to enable high performance network connectivity across WAN links, it is important to study and characterize the actual performance that the existing IB middleware achieve in these emerging IB WAN scenarios.

In this context, we study and analyze the performance characteristics of various HPC Middleware. We utilize the Obsidian IB WAN routers for inter-cluster connectivity. Our results show that many of the applications absorb smaller network delays fairly well. However, most approaches get severely impacted in high delay scenarios. Further, communication protocols need to be optimized in higher delay scenarios to improve the performance. Our experimental results show that techniques such as WAN-aware protocols, transferring data using large messages (message coalescing) and using parallel data streams can improve the communication performance (upto 50%) in high delay scenarios. Overall, these results demonstrate that IB WAN technologies can enable cluster-of-clusters architecture as a feasible platform for HPC systems.

Results

Here, we measure the peak bandwidth of the WAN link with different transmission protocols with increasing network delays. We can see that the IB verbs achieves the highest bandwidth and it sustains very well through the whole range of delays (WAN distance), while the TCP/IP bandwidth drops fast with the increasing delay, which in turn jeopardizes the performance.

Zero-Copy Mechanisms for High Performance Data Transfers over WAN

FTP has been the most popular method to transfer large files for data-staging, replication, and the like. While existing FTP mechanisms have improved gradually with newer networks, they still inherit the fundamental limitations imposed by the underlying networking protocols (TCP/UDP) they use. These include limited network bandwidth utilization, high memory bandwidth and CPU utilization that TCP/UDP cause on the end-nodes. Thus both the performance and scalability of such systems is limited. The advent of InfiniBand (IB) WAN has enabled the use of high performance transport protocols in the WAN scenarios, which can be leveraged for designing FTP mechanisms. Enabling IB-based FTP capabilities and providing good efficiency for such transfers presents considerable challenge.

In this work, we present an Advanced Data Transfer Service (ADTS) to enable efficient data transfers over WAN. We leverage the ADTS's capabilities to design high performance file transfer mechanisms (FTP based on ADTS). Our ADTS layer improves data transfer performance by optimizing several aspects including efficient buffer management, memory registration cache, pipelining of data transfers, reducing TCP/IP related data copies, and maintaining persistent FTP data sessions. Further, we reduce the CPU utilization required for the data-transfers (by up to a factor of 6) and demonstrate a significantly higher FTP server scalability. In our experimental results, we observe that our FTP-ADTS design outperforms existing TCP and UDP based approaches by more that 80% in transferring large volumes of data. In addition, we utilize the WAN emulation capabilities of Obsidian InfiniBand WAN routers to study the impact of our designs in a wide range of WAN scenarios, leading to solutions that enable the design of highly capable WAN communication protocols required to power the next-generation high performance parallel and distributed environments.

Results:

We compare the performance of our design (FTP-ADTS), GridFTP and FTP over UDP with varying WAN delays in the figure on the left. We observe that the FTP-ADTS sustains performance for larger WAN delays quite well, while the GridFTP shows a steep latency increase when the WAN delay is 10000 us. GridFTP has a number of optimizations that can be done to improve it's performance further. The numbers we have shown here are the ones taken without enabling any of those optimizations. In the high delay scenario, our FTP delivers six times better performance, which shows significant promise for our FTP-ADTS design. The improvement is not only due to the underlying zero-copy operations being faster than the TCP/UDP, but also because the network throughput is the bottleneck for IPoIB over WAN, where issues such as RTT time, MTU size and buffer size can severely impact the performance.

The figure on the right shows the server side CPU utlization of the various FTP schemes we considered. As can clearly be seen, FTP-ADTS takes up much less CPU time compared to the other two FTP versions, there by freeing up the valuable CPU time to use by other applications.

Currently, we are also extending our research focus in several directions:

  • Design a comprehensive suite of benchmarks to study the various performance characteristics of InfiniBand WAN.
  • Leverage the features of InfiniBand WAN and iWARP-capable interconnects such as 10-GigE to design scalable geographically distributed data-centers.
  • Design of high performance WAN protocols for HTTP and FTP leveraging the advanced capabilities of InfiniBand WAN.

Conferences & Workshops (1)

1

Sponsors

NSF
Intel
Mellanox
Obsidian