OpenSolaris NFS over RDMA Project

Overview

The aim of this project is to design and develop a high performance RDMA-enabled RPC transport for the NFS client and server in OpenSolaris. Existing transports based on TCP and UDP are limited by copying overhead, resulting in low aggregate bandwidth and high CPU utilization. In addition, the current RDMA-Read (Read-Read design) based design has performance and security limitations.

Objectives

  • Security: The current Read-Read based design allows the client to RDMA Read data from the servers buffers. This has a number of security and resource vulnerabilities. A malicious client may attempt to read data which it is not authorized to do so. In addition, the client is now in control of the server buffers and may attempt to tie up several of these buffers for an inordinate amount of time, denying service to other clients. The solution to both these problems lies in disallowing the client from issuing any RDMA operations to the server. The server instead uses a combination of RDMA Read and Write operations (Read-Write design) to fulfill bulk data transfer operations. This effectively eliminates the threat from malicious clients by allowing the server to not expose it buffers, in addition to maintaining complete server control over the allocation and usage of its buffers.
  • Performance: Moving from a Read-Read design to a Read-Write design allows us to tackle some of the security and resource vulnerabilities present in the current design. One of the limitations of InfiniBand RDMA operations is the need to register buffers with the the HCA before issuing the operation. Because of the nature of the NFS protocol, these registrations must be performed in the critical path of each NFS operation. As a result, they constitute a substantial overhead. In this research, we have explored reducing the overhead of these registration operations through the use of Fast Memory Registration (FMR) and a registration cache.
  • Interoperability: The final objective of this project is allow it to interoperate with other RDMA enabled implementations such as the Linux NFS over RDMA. The Linux implementation of NFS/RDMA uses additional techniques to further reduce registration overhead and directly place data into its target buffers. These techniques include the use of physical registration in combination with a larger number of NFS operations in the long reply phase of the RPC/RDMA protocol. These technique require design changes to accommodate these difference efficiently and effectively in the Read-Write design. Finally, connection setup management differences between the InfiniBand stacks on the two systems require additional mechanisms to seamlessly manage these objectives.

Description

Keeping in mind the goals of security, performance and interoperability, we have designed and implemented a complete solution of the Read-Write design. This prototype is being incorporated into the OpenSolaris kernel code base. A recent thrust of this project is to develop a pNFS prototype for OpenSolaris. pNFS attempts to address the limited bandwidth scaling of the current single NFS server with multiple clients. The design goals will also take into account the requirements of NFSV4, while providing increased aggregate bandwdith through multiple data servers which are decoupled from the metadata servers.

Related Documents

Results

Our testbed consists of dual Opteron x2100's with 2GB memory and Single Data Rate (SDR) x8 PCI-Express InfiniBand Adapters. These systems were running OpenSolaris build version 33. The back-end file system used was tmpfs which is a memory based file system.

   

  • Read-Read versus Read-Write: The multi-threaded NFS Read bandwidth achieved with the Read-Read (RR) and Read-Write (RW) designs. The CPU utilization is also shown overlayed. These results were measured with the popular tool IOzone. The IOzone file size used was 128 MegaBytes to accommodate reasonable multi-threaded workloads (IOzone creates a separate file for each thread). The IOzone record size was fixed at 128KB. While there is an initial difference in the bandwidth at a smaller number of threads, the bandwidth for both designs saturates at about 350 MB/s with an increasing number of threads (with the Read-Write design delivering slightly better bandwidth). However, the Read-Write design shows better CPU utilization, largely because of a reduction in the number of data copies.
  •    

  • Impact of Memory Registration: We look at the impact of FMR and the buffer registration cache on the performance of the OpenSolaris NFS/RDMA Read bandwidth. The IOzone file size used was 128 MegaBytes. The IOzone record size was kept fixed 128KB. FMR can reduce registration overhead and increase multi-threaded bandwidth from a peak of about 350 MB/s to about 400 MB/s. The buffer registration cache on the other hand can increase bandwidth to approximately 700 MB/s.
  •    

  • OpenSolaris versus Linux: Finally, this figure shows a comparison of the basic Linux-Linux NFS/RDMA design and the OpenSolaris-OpenSolaris NFS/RDMA design. These numbers were measured using the basic register mode, and without any registration enhancements such as FMR, buffer registration cache or physical registration turned on.
  • Sponsors

    Sun Micrososystems
    Network Appliance