The aim of this project is to design and develop
a high performance RDMA-enabled RPC transport for
the NFS client and server in OpenSolaris.
Existing transports based
on TCP and UDP are limited by copying overhead,
resulting in low aggregate bandwidth and high
CPU utilization.
In addition, the current RDMA-Read (Read-Read design) based design has
performance and security limitations.
This project has three broad objectives.
Security:
The current Read-Read based design allows the client to
RDMA Read data from the servers buffers. This has a number
of security and resource vulnerabilities. A malicious
client may attempt to read data which it is not authorized to
do so. In addition, the client is now in control of the server
buffers and may attempt to tie up several of these buffers for an inordinate
amount of time, denying service to other clients.
The solution to both these problems lies in
disallowing the client from issuing any RDMA operations to the
server. The server instead uses a combination of RDMA Read and Write
operations (Read-Write design) to fulfill bulk data transfer operations.
This effectively
eliminates the threat from malicious clients by allowing the server to
not expose it buffers, in addition to maintaining complete server control over
the allocation and usage of its buffers.
Performance:
Moving from a Read-Read design to a Read-Write design allows us to tackle some of
the security and resource vulnerabilities present in the current design. One of the
limitations of InfiniBand RDMA operations is the need to register buffers with the
the HCA before issuing the operation. Because of the nature of the NFS protocol,
these registrations must be performed in the critical path of each NFS operation.
As a result, they constitute a substantial overhead. In this research, we have explored
reducing the overhead of these registration operations through the use of Fast Memory
Registration (FMR) and a registration cache.
Interoperability:
The final objective of this project is allow it to interoperate with other RDMA enabled
implementations such as the
Linux NFS over RDMA.
The Linux implementation of NFS/RDMA uses additional techniques to further reduce
registration overhead and directly place data into its target buffers. These techniques
include the use of physical registration in combination with a larger number
of NFS operations in the long reply phase of the RPC/RDMA protocol. These technique require
design changes to accommodate these difference efficiently and effectively in the Read-Write
design. Finally, connection setup management differences between the InfiniBand stacks on the
two systems require additional mechanisms to seamlessly manage these objectives.
Keeping in mind the goals of security, performance and interoperability, we have designed
and implemented a complete solution of the Read-Write design. This prototype is being incorporated
into the OpenSolaris kernel code base.
A recent thrust of this project is to develop a pNFS prototype for OpenSolaris. pNFS attempts
to address the limited bandwidth scaling of the current single NFS server with multiple clients.
The design goals will also take into account the requirements of NFSV4, while providing
increased aggregate bandwdith through multiple data servers which are decoupled from the
metadata servers.
Lei Chai, Xiangyong Ouyang, Ranjit Noronha and Dhabaleswar K. Panda,
pNFS/PVFS2 over InfiniBand: Early Experiences. Petascale Data Storage Workshop 2007, to be held in
conjunction with SuperComputing (SC) 2007, Reno, Nevada. (pdf). Technical Report pdf [Talk Slides] .
Ranjit Noronha, Lei Chai, Thomas Talpey and Dhabaleswar K. Panda,
Designing NFS With RDMA For Security, Performance and Scalability. The
2007 International Conference on Parallel Processing (ICPP-07), Xi'an, China. (pdf). Technical Report pdf . Presentation Slides pdf[bibtex] .
Ranjit Noronha, Lei Chai, Spencer Shepler and Dhabaleswar K. Panda,
Enhancing the Performance of NFSv4 with RDMA. International Workshop on Storage Network
Architecture and Parallel I/Os (SNAPI'07), San Diego, CA. (pdf) [bibtex]. .
Ranjit Noronha, Lei Chai, Thomas Talpey and Dhabaleswar K. Panda,
Better NFS through RDMA and Efficient Memory Registration. OSU-CISRC-1/07--TR06. (pdf)
Weikuan Yu, Ranjit Noronha, Lei Chai, Shuang Liang, and Dhabaleswar K. Panda,
Optimizing OpenSolaris NFS over RDMA. OSU-CISRC-4/060TR43. (pdf)
Our testbed consists of
dual Opteron x2100's with 2GB memory and Single Data Rate
(SDR) x8 PCI-Express InfiniBand Adapters.
These systems were running OpenSolaris build version 33.
The back-end file system used was tmpfs which is a memory based
file system.
Read-Read versus Read-Write:
The multi-threaded NFS Read bandwidth achieved with the Read-Read (RR) and
Read-Write (RW) designs. The CPU utilization is also shown overlayed. These results
were measured with the popular tool IOzone.
The IOzone file size used was 128 MegaBytes to accommodate reasonable multi-threaded
workloads (IOzone creates a separate file for each thread). The IOzone record size was fixed at 128KB. While there is an initial difference in the bandwidth at a smaller number of threads,
the bandwidth for both designs saturates at about 350 MB/s with an increasing number of threads
(with the Read-Write design delivering slightly better bandwidth). However, the Read-Write design shows better CPU utilization, largely because of a reduction in the number of data copies.
Impact of Memory Registration:
We look at the impact of FMR and the buffer registration cache on the performance
of the OpenSolaris NFS/RDMA Read bandwidth. The IOzone file size used was 128 MegaBytes.
The IOzone record size was kept fixed 128KB.
FMR can reduce registration overhead and increase multi-threaded bandwidth from a peak of about
350 MB/s to about 400 MB/s. The buffer registration cache on the other hand can increase
bandwidth to approximately 700 MB/s.
OpenSolaris versus Linux:
Finally, this figure shows a comparison of the basic Linux-Linux NFS/RDMA design and the
OpenSolaris-OpenSolaris NFS/RDMA design. These numbers were measured using the basic
register mode, and without any registration enhancements such as FMR, buffer registration
cache or physical registration turned on.