Fault Tolerance Backplane (FTB)
Network-based Computing Laboratory
Department of Computer Science and Engineering
Ohio State University

Overview         FTB-IB         FTB-IPMI         MVAPICH2-FTB-CR         MVAPICH2-FTB-Migration         Publications


Modern High Performance Computing Systems have grown in size to thousands of nodes, having tens of thousands of processors and thousands of disks, which has resulted in an increase in the rate of failures. Fault Management in such systems has been traditionally handled by individual components at different levels in the stack, with no coordination among different components in the system. As a result, different components cannot benefit from the knowledge of the failures experienced by others. The CiFTS initiative aims to provide a coordinated infrastructure that will enable Fault Tolerance Systems to adapt to faults occurring in the operating environment in a holistic manner.

The Fault Tolerance Backplane provides a common infrastructure for the Operating System, Middleware, Libraries and Applications to exchange information related to hardware and software failure in real time. Fault-aware components can subscribe to be notified about one or more events of interest from other components, and notify other components about the faults it comes across.


Fault Tolerance Backplane (Courtesy the CiFTS Team)
Fault Tolerance Backplane (Courtesy the CiFTS Team)

The OSU team is involved in three important projects along this direction:

An overview of these three projects, together with the list of events and associated code for download are indicated below.

FTB-IB

The FTB-IB component uses the FTB Infrastructure to notify other FTB enabled components about failures in the InfiniBand (IB) network. The FTB-IB component uses the Asynchronous Event Handler provided by the IB Verbs library that is part of the OFED Software. Applications that require notification about one or more of these events can use the FTB infrastructure to subscribe to them.

FTB-IB is supported on Linux systems that have the FTB Software (API Version 0.5) and the OFED Software packages installed. For more information about installing FTB, please visit http://www.mcs.anl.gov/research/cifts/. For more information about installing OFED, please visit http://www.openfabrics.org/.

Download the latest version of FTB-IB (Version 1.0) here. FTB-IB is also available through anonymous SVN at https://mvapich.cse.ohio-state.edu/svn/ftb-ib.

The list of FTB events related to InfiniBand network failures that FTB-IB currently throws is available here.

The FTB-IB presentation at SC08 can be found here.


FTB-IPMI

The Intelligent Platform Management Interface (IPMI) is a standard interface to manage a computer system independently of the operating system. The IPMI specification has been implemented by many hardware vendors. As long as the system is connected to a power source, IPMI allows out-of-band monitoring of the hardware and software status of a system and allows remote actions (such as system reboot) - even in cases of operating system crashes.

The FTB-IPMI is an FTB interface for the Intelligent Platform Management Interface. The FTB-IPMI is a software that efficiently monitors a set of compute nodes using the IPMI interface and publishes fault events using the FTB when a problem is detected. The FTB-IPMI software relies on the GNU FreeIPMI library, which supports and implements IPMI for a wide range of hardware and provides a rich set of system sensor information. FTB-IPMI monitors the sensor information provided by the FreeIPMI library, analyzes this information for severity status, converts any fault information that is categorized as a `warning' or a `failure' to the FTB messages and publishes it to the FTB. These FTB events can be caught by any other FTB-aware component to detect and/or predict failures.

Download the latest version of FTB-IB (Version 1.0) here.


MVAPICH2-FTB-CR

MVAPICH2 library has been supporting BLCR-based Checkpoint-Restart (CR) since 0.9.8 version in 2006. An integrated support of FTB to carry out Checkpoint-Restart was introduced in 1.4 version in 2008. An enhanced support for Fast Checkpoint-Restart with aggregation has been introduced in MVAPICH2 1.6 in 2010.

The list of FTB events related to Checkpoint/Restart supported in this MVAPICH2 release are available here.

The latest MVAPICH2 can be downloaded from MVAPICH Web site.

Detailed guidelines for using CR support in MVAPICH2 (Basic, with FTB and with aggregation) are available here.


MVAPICH2-FTB-MIGRATION

MVAPICH2 1.6 now supports a Job Pause-Migration-Restart Framework for Pro-active Fault-Tolerance. This framework has incorporated Checkpoint/Restart/Migration related FTB events. It's able to achieve pro-active fault tolerance by reacting to failure prediction events. When an imminent failure is predicted by a system monitoring module, MVAPICH2 will pause a running job, migrate processes from a going-to-fail node to a healthy spare node, and restart the job without killing the job.

Detailed guidelines for using FTB-MIGRATION support in MVAPICH2 to perform a process migration can be found here.
The list of FTB events related to Job Pause-Migration-Restart supported in this MVAPICH2 release are available here.


Publications

  • R. Rajachandrasekar, X. Besseron and D. K. Panda, Monitoring and Predicting Hardware Failures in HPC Clusters with FTB-IPMI, Int'l Workshop on System Management Techniques, Processes, and Services (SMTPS), in conjunction with Int'l Parallel and Distributed Processing Symposium (IPDPS '12), May 2012.
  • X. Ouyang, R. Rajachandrasekar, X. Besseron, D. K. Panda, High Performance Pipelined Process Migration with RDMA, The 11th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing (CCGrid 2011), May 2011 Conference Slides.
  • X. Ouyang, S. Marcarelli, R. Rajachandrasekar and D. K. Panda, RDMA-Based Job Migration Framework for MPI over InfiniBand, IEEE International Conference on Cluster Computing 2010 (Cluster '10), Sept. 2010 Conference Slides.

  • X. Ouyang, S. Marcarelli and D. K. Panda, Enhancing Checkpoint Performance with Staging IO and SSD, IEEE International Workshop on Storage Network Architecture and Parallel I/Os ( SNAPI), May 2010. Conference Slides.
  • X. Ouyang, K. Gopalakrishnan, T. Gangadharappa and D. K. Panda, Fast Checkpointing by Write Aggregation with Dynamic Buffer and Interleaving on Multicore Architecture, Int'l Conference on High Performance Computing (HiPC'09), Dec. 2009. Conference Slides.
  • X. Ouyang, K. Gopalakrishnan and D. K. Panda, Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore S ystems, Int'l Conference on Parallel Processing (ICPP'09), Sept. 2009. Conference Slides.
  • R. Gupta, P. Beckman, B.H. Park, E. Lusk, P. Hargrove, A. Geist, D. K. Panda, A. Lumsdaine and J. Dongarra, CIFTS: A Coordinated Infrastructure for Fault-Tolerant Systems, The 38th International Conference on Parallel Processing (ICPP'09), September 2009.

  • Q. Gao, W. Huang, M. Koop, and D. K. Panda, Group-based Coordinated Checkpointing for MPI: A Case Study on InfiniBand, Int'l Conference on Parallel Processing, XiAn, China, September 2007. Conference Slides
  • Q. Gao, W. Yu, W. Huang and D. K. Panda, Application-Transparent Checkpoint/Restart for MPI Programs over InfiniBand, Int'l Conference on Parallel Processing (ICPP), August 2006. Conference Slides

  • Contact: Dhabaleswar K. Panda
    2002-2012 NBCL. All rights reserved.
    774 Dreese Laboratories
    2015 Neil Avenue
    Columbus, OH 43210