|
Fault Tolerance Backplane - InfiniBand (FTB-IB) Network-based Computing Laboratory Department of Computer Science and Engineering |
|
| Overview | Requirements | Download | List of Events |
Modern High Performance Computing Systems have grown in size to thousands of
nodes, having tens of thousands of processors and thousands of disks, which
has resulted in an increase in the rate of failures. Fault Management in such
systems has been traditionally handled by individual components at different
levels in the stack, with no coordination among different components in the
system. As a result, different components cannot benefit from the knowledge of
the failures experienced by others. The
CiFTS initiative
aims to provide a coordinated infrastructure that will enable Fault Tolerance
Systems to adapt to faults occurring in the operating environment in a holistic
manner.
The Fault Tolerance Backplane provides a common infrastructure for the Operating
System, Middleware, Libraries and Applications to exchange information related
to hardware and software failure in real time. Fault-aware components can subscribe
to be notified about one or more events of interest from other components, and
notify other components about the faults it comes across.
The FTB-IB component uses the FTB Infrastructure to notify other FTB enabled components about failures in the InfiniBand network. The FTB-IB component uses the Asynchronous Event Handler provided by the IB Verbs library that is part of the OFED Software. Applications that require notification about one or more of these events can use the FTB infrastructure to subscribe to them.
FTB-IB is supported on Linux systems that have the FTB Software (API Version 0.5) and the OFED Software packages installed.
For more information about installing FTB, please visit
http://www.mcs.anl.gov/research/cifts/
For more information about installing OFED, please visit
http://www.openfabrics.org
Download the latest version of FTB-IB (Version 1.0) here
FTB-IB is also available through anonymous SVN at https://mvapich.cse.ohio-state.edu/svn/ftb-ib
The list of events related to InfiniBand network failures that FTB-IB currently throws is available here
|
Contact:
Dhabaleswar K. Panda
2002-2008 NBCL. All rights reserved. |
774 Dreese Laboratories 2015 Neil Avenue Columbus, OH 43210 |