Fault Tolerance Backplane - InfiniBand (FTB-IB)
Network-based Computing Laboratory
Department of Computer Science and Engineering
Ohio State University

Overview        Requirements        Download        List of Events


Modern High Performance Computing Systems have grown in size to thousands of nodes, having tens of thousands of processors and thousands of disks, which has resulted in an increase in the rate of failures. Fault Management in such systems has been traditionally handled by individual components at different levels in the stack, with no coordination among different components in the system. As a result, different components cannot benefit from the knowledge of the failures experienced by others. The CiFTS initiative aims to provide a coordinated infrastructure that will enable Fault Tolerance Systems to adapt to faults occurring in the operating environment in a holistic manner.

The Fault Tolerance Backplane provides a common infrastructure for the Operating System, Middleware, Libraries and Applications to exchange information related to hardware and software failure in real time. Fault-aware components can subscribe to be notified about one or more events of interest from other components, and notify other components about the faults it comes across.


Fault Tolerance Backplane (Courtesy the CiFTS Team)
Fault Tolerance Backplane (Courtesy the CiFTS Team)

The FTB-IB component uses the FTB Infrastructure to notify other FTB enabled components about failures in the InfiniBand network. The FTB-IB component uses the Asynchronous Event Handler provided by the IB Verbs library that is part of the OFED Software. Applications that require notification about one or more of these events can use the FTB infrastructure to subscribe to them.



Requirements

FTB-IB is supported on Linux systems that have the FTB Software (API Version 0.5) and the OFED Software packages installed.

For more information about installing FTB, please visit
http://www.mcs.anl.gov/research/cifts/

For more information about installing OFED, please visit
http://www.openfabrics.org



Download

Download the latest version of FTB-IB (Version 1.0) here

FTB-IB is also available through anonymous SVN at https://mvapich.cse.ohio-state.edu/svn/ftb-ib

The FTB-IB presentation at SC08 can be found here



List of Events

The list of events related to InfiniBand network failures that FTB-IB currently throws is available here



Contact: Dhabaleswar K. Panda
2002-2008 NBCL. All rights reserved.
774 Dreese Laboratories
2015 Neil Avenue
Columbus, OH 43210