Fault Tolerance Backplane - InfiniBand

Overview

Modern High Performance Computing Systems have grown in size to thousands of nodes, having tens of thousands of processors and thousands of disks, which has resulted in an increase in the rate of failures. Fault Management in such systems has been traditionally handled by individual components at different levels in the stack, with no coordination among different components in the system. As a result, different components cannot benefit from the knowledge of the failures experienced by others. The CiFTS initiative aims to provide a coordinated infrastructure that will enable Fault Tolerance Systems to adapt to faults occurring in the operating environment in a holistic manner.

Description

The Fault Tolerance Backplane provides a common infrastructure for the Operating System, Middleware, Libraries and Applications to exchange information related to hardware and software failure in real time. Fault-aware components can subscribe to be notified about one or more events of interest from other components, and notify other components about the faults it comes across.


Fault Tolerance Backplane (Courtesy the CiFTS Team)
Fault Tolerance Backplane (Courtesy the CiFTS Team)

The OSU team is involved in three important projects along this direction:

  • FTB support over InfiniBand (FTB-IB)
  • FTB support in MPI stack for Checkpoint-Restart (MVAPICH2-FTB-CR)
  • FTB support in MPI stack for Job Pause-Migration-Restart (MVAPICH2-FTB-MIGRATION)
An overview of these three projects, together with the list of events and associated code for download are indicated below.

FTB-IB

The FTB-IB component uses the FTB Infrastructure to notify other FTB enabled components about failures in the InfiniBand (IB) network. The FTB-IB component uses the Asynchronous Event Handler provided by the IB Verbs library that is part of the OFED Software. Applications that require notification about one or more of these events can use the FTB infrastructure to subscribe to them.

FTB-IB is supported on Linux systems that have the FTB Software (API Version 0.5) and the OFED Software packages installed. For more information about installing FTB, please visit http://www.mcs.anl.gov/research/cifts/. For more information about installing OFED, please visit http://www.openfabrics.org/.

Download the latest version of FTB-IB (Version 1.0) here. FTB-IB is also available through anonymous SVN at https://mvapich.cse.ohio-state.edu/svn/ftb-ib.

The list of FTB events related to InfiniBand network failures that FTB-IB currently throws is available here.

The FTB-IB presentation at SC08 can be found here.


FTB-IPMI

The Intelligent Platform Management Interface (IPMI) is a standard interface to manage a computer system independently of the operating system. The IPMI specification has been implemented by many hardware vendors. As long as the system is connected to a power source, IPMI allows out-of-band monitoring of the hardware and software status of a system and allows remote actions (such as system reboot) - even in cases of operating system crashes.

The FTB-IPMI is an FTB interface for the Intelligent Platform Management Interface. The FTB-IPMI is a software that efficiently monitors a set of compute nodes using the IPMI interface and publishes fault events using the FTB when a problem is detected. The FTB-IPMI software relies on the GNU FreeIPMI library, which supports and implements IPMI for a wide range of hardware and provides a rich set of system sensor information. FTB-IPMI monitors the sensor information provided by the FreeIPMI library, analyzes this information for severity status, converts any fault information that is categorized as a `warning' or a `failure' to the FTB messages and publishes it to the FTB. These FTB events can be caught by any other FTB-aware component to detect and/or predict failures.

Download the latest version of FTB-IB (Version 1.0) here.


MVAPICH2-FTB-CR

MVAPICH2 library has been supporting BLCR-based Checkpoint-Restart (CR) since 0.9.8 version in 2006. An integrated support of FTB to carry out Checkpoint-Restart was introduced in 1.4 version in 2008. An enhanced support for Fast Checkpoint-Restart with aggregation has been introduced in MVAPICH2 1.6 in 2010.

The list of FTB events related to Checkpoint/Restart supported in this MVAPICH2 release are available here.

The latest MVAPICH2 can be downloaded from MVAPICH Web site.

Detailed guidelines for using CR support in MVAPICH2 (Basic, with FTB and with aggregation) are available here.


MVAPICH2-FTB-MIGRATION

MVAPICH2 1.6 now supports a Job Pause-Migration-Restart Framework for Pro-active Fault-Tolerance. This framework has incorporated Checkpoint/Restart/Migration related FTB events. It's able to achieve pro-active fault tolerance by reacting to failure prediction events. When an imminent failure is predicted by a system monitoring module, MVAPICH2 will pause a running job, migrate processes from a going-to-fail node to a healthy spare node, and restart the job without killing the job.

Detailed guidelines for using FTB-MIGRATION support in MVAPICH2 to perform a process migration can be found here.
The list of FTB events related to Job Pause-Migration-Restart supported in this MVAPICH2 release are available here.

Conferences & Workshops (8)

1
2
3
4
5
6
7
8