Fault Tolerance Backplane (FTB)
Network-based Computing Laboratory
Department of Computer Science and Engineering
Ohio State University

Overview         FTB-IB         MVAPICH2-FTB-CR         Publications


Modern High Performance Computing Systems have grown in size to thousands of nodes, having tens of thousands of processors and thousands of disks, which has resulted in an increase in the rate of failures. Fault Management in such systems has been traditionally handled by individual components at different levels in the stack, with no coordination among different components in the system. As a result, different components cannot benefit from the knowledge of the failures experienced by others. The CiFTS initiative aims to provide a coordinated infrastructure that will enable Fault Tolerance Systems to adapt to faults occurring in the operating environment in a holistic manner.

The Fault Tolerance Backplane provides a common infrastructure for the Operating System, Middleware, Libraries and Applications to exchange information related to hardware and software failure in real time. Fault-aware components can subscribe to be notified about one or more events of interest from other components, and notify other components about the faults it comes across.


Fault Tolerance Backplane (Courtesy the CiFTS Team)
Fault Tolerance Backplane (Courtesy the CiFTS Team)

The OSU team is involved in two important projects along this direction:

An overview of these two projects, together with the list of events and associated code for download are indicated below.

FTB-IB

The FTB-IB component uses the FTB Infrastructure to notify other FTB enabled components about failures in the InfiniBand (IB) network. The FTB-IB component uses the Asynchronous Event Handler provided by the IB Verbs library that is part of the OFED Software. Applications that require notification about one or more of these events can use the FTB infrastructure to subscribe to them.

FTB-IB is supported on Linux systems that have the FTB Software (API Version 0.5) and the OFED Software packages installed. For more information about installing FTB, please visit http://www.mcs.anl.gov/research/cifts/. For more information about installing OFED, please visit http://www.openfabrics.org/.

Download the latest version of FTB-IB (Version 1.0) here. FTB-IB is also available through anonymous SVN at https://mvapich.cse.ohio-state.edu/svn/ftb-ib.

The list of FTB events related to InfiniBand network failures that FTB-IB currently throws is available here.

The FTB-IB presentation at SC08 can be found here.


MVAPICH2-FTB-CR

MVAPICH2 1.4 Release has integrated support of FTB to Carry out Checkpoint Restart (CR).

The list of FTB-CR commands supported in this MVAPICH2 release are available here.

MVAPICH2 1.4 can be downloaded from MVAPICH Web site.

Detailed guidelines for using FTB-CR support in MVAPICH2 to carry out Checkpoint-Restart are available here.


Publications

  • R. Gupta, P. Beckman, B.H. Park, E. Lusk, P. Hargrove, A. Geist, D. K. Panda, A. Lumsdaine and J. Dongarra, CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems, The 38th International Conference on Parallel Processing (ICPP'09), September 2009.
    Contact: Dhabaleswar K. Panda
    2002-2009 NBCL. All rights reserved.
    774 Dreese Laboratories
    2015 Neil Avenue
    Columbus, OH 43210