Fault Tolerance Backplane - InfiniBand
Fault Tolerance Backplane (Courtesy the CiFTS Team)
The OSU team is involved in three important projects along this direction:
- FTB support over InfiniBand (FTB-IB)
- FTB support in MPI stack for Checkpoint-Restart (MVAPICH2-FTB-CR)
- FTB support in MPI stack for Job Pause-Migration-Restart (MVAPICH2-FTB-MIGRATION)
The FTB-IB component uses the FTB Infrastructure to notify other FTB enabled components about failures in the InfiniBand (IB) network. The FTB-IB component uses the Asynchronous Event Handler provided by the IB Verbs library that is part of the OFED Software. Applications that require notification about one or more of these events can use the FTB infrastructure to subscribe to them.FTB-IB is supported on Linux systems that have the FTB Software (API Version 0.5) and the OFED Software packages installed. For more information about installing FTB, please visit http://www.mcs.anl.gov/research/cifts/. For more information about installing OFED, please visit http://www.openfabrics.org/.
The list of FTB events related to InfiniBand network failures that FTB-IB currently throws is available here.
The FTB-IB presentation at SC08 can be found here.
The Intelligent Platform Management Interface (IPMI) is a standard interface to manage a computer system independently of the operating system. The IPMI specification has been implemented by many hardware vendors. As long as the system is connected to a power source, IPMI allows out-of-band monitoring of the hardware and software status of a system and allows remote actions (such as system reboot) - even in cases of operating system crashes.
The FTB-IPMI is an FTB interface for the Intelligent Platform Management Interface. The FTB-IPMI is a software that efficiently monitors a set of compute nodes using the IPMI interface and publishes fault events using the FTB when a problem is detected. The FTB-IPMI software relies on the GNU FreeIPMI library, which supports and implements IPMI for a wide range of hardware and provides a rich set of system sensor information. FTB-IPMI monitors the sensor information provided by the FreeIPMI library, analyzes this information for severity status, converts any fault information that is categorized as a `warning' or a `failure' to the FTB messages and publishes it to the FTB. These FTB events can be caught by any other FTB-aware component to detect and/or predict failures.
Download the latest version of FTB-IB (Version 1.0) here.
MVAPICH2 library has been supporting BLCR-based Checkpoint-Restart (CR) since 0.9.8 version in 2006. An integrated support of FTB to carry out Checkpoint-Restart was introduced in 1.4 version in 2008. An enhanced support for Fast Checkpoint-Restart with aggregation has been introduced in MVAPICH2 1.6 in 2010.
The list of FTB events related to Checkpoint/Restart supported in this MVAPICH2 release are available here.
The latest MVAPICH2 can be downloaded from MVAPICH Web site.
Detailed guidelines for using CR support in MVAPICH2 (Basic, with FTB and with aggregation) are available here.
MVAPICH2 1.6 now supports a Job Pause-Migration-Restart Framework for Pro-active Fault-Tolerance. This framework has incorporated Checkpoint/Restart/Migration related FTB events. It's able to achieve pro-active fault tolerance by reacting to failure prediction events. When an imminent failure is predicted by a system monitoring module, MVAPICH2 will pause a running job, migrate processes from a going-to-fail node to a healthy spare node, and restart the job without killing the job.
Detailed guidelines for using FTB-MIGRATION support in MVAPICH2 to
perform a process migration
can be found
The list of FTB events related to Job Pause-Migration-Restart supported in this MVAPICH2 release are available here.
Conferences & Workshops (8)