NIC-level Support for Collective Communication and Synchronization with Myrinet and Quadrics

Overview

Collective operations such as barrier, broadcast, and reduction are very important in many parallel and distributed programs. For example, an efficient implementation of barrier is important because while processors are waiting on a barrier, generally, no computation can be performed, which impacts parallel speedup. The efficiency of barrier operations also affects the granularity of parallel computation. Some Network Interface Cards (NICs) such as Myrinet have programmable processors which can be customized to support collective communication.

Objectives

Current research along this direction investigates the following issues:

  • Host-Bypass NIC-Level Collective Operations on Programmable NIC
  • NIC-Based Offload of Dynamic User-Defined Modules

Description

We take on the challenge to an efficient design and implement for NIC-level collective operations, where the protocol is processed at the NIC rather than at the host. By the protocol at the NIC, the messages for collectives need not to be passed up to the host to have the next message sent back down from host to the NIC. Instead, upon the reception of a message, the NIC can immediately send the next message.

Journals (1)

1 A. Wagner, D. Buntinas, R. Brightwell, and D. K. Panda, Application-Bypass Reduction for Large-Scale Clusters. Int'l Journal of High Performance Computing and Networking , Internationall Journal of High Performance Computing and Networking, Cluster 2003 Special Issue. In Press , Dec 2003.

Conferences & Workshops (12)

1
2
3
4
5
6
7
8
9
10
11
12

M.S. Thesis (2)

1 A. Wagner, Static and Dynamic Processing Offload on Myrinet Clusters with Programmable NIC Support, Jun 2004
2 A. Moody, NIC-based Reduction on Large-Scale Quadrics Clusters, Dec 2003