# Network challenges and directions for the Exascale era

ExaComm workshop, ISC '18 28 June 2018



## **General network scaling challenges**



- Cost
- Energy
- Scalable performance
- Reliability

## Performance-related challenges



- Message rate, efficiency
- All-to-all performance
- Low latency
- Offload

#### **Next generation HPC network nirvana**



- Large HPC customers (e.g., U.S. national labs) have said they desire 0.1 byte/flop or op
  - Total of ingress and egress bandwidth (0.05 + 0.05)
  - Although "fat" nodes make lower ratios acceptable
    - Surface to volume ratio decreases with memory size and computational capability
- For an exa-op system, this would be (50 + 50) PB/s total bandwidth to/from endpoints
- Is nirvana achievable?

## **HPC** network grand challenge #1: Cost



- 15% rule
  - Supercomputer customers are typically willing to pay up to 15% of the system cost for the interconnection network
- Assume \$400M for an exaflop system: \$60M max for the interconnection network
- Assume highly-scalable, bandwidth- and cost-efficient topology:
  - Dragonfly 2.5 bidirectional links/cables per endpoint
  - In the best case, only 0.5 links @ 50 Gb/s signaling per endpoint are optical
  - But this will increase to 1.5 links @ 100 Gb/s signaling
- Assume all network cost is from optical links (a very optimistic assumption)
  - Requires optical link cost ≤ 15¢/Gb/s for 50 Gb/s signaling
  - An ≤ 5¢/Gb/s for 100Gbps signaling
  - But today we are > \$1/Gb/s, resulting in \$400M for optics alone

## **HPC network grand challenge #2: Energy**





System size

## **HPC** network grand challenge #2: Energy



- Example: U.S. Dept. of Energy desires ≤ 30 MW for an exascale system
  - Assume 15% allocation of energy to the network: 4.5 MW
- Again assume a highly-scalable, bandwidth- and cost-efficient topology:
  - Dragonfly 2.5 links per endpoint
  - In the best case, only 0.5 links @ 50 Gb/s signaling per endpoint are optical
  - But this will increase to 1.5 links @ 100 Gb/s signaling
- Assume switches and electrical links consume half of network power
  - It is actually much more than this today for cluster networks
  - Assume other half of energy from the optical links (2.25 MW)
  - Requires 5.6 pJ/bit for the optical links, with 50 Gb/s signaling
  - Requires 1.9 pJ/bit for optical links, with 100 Gb/s signaling
  - Difficult to achieve, but cost is a much bigger challenge

#### HPC network grand challenge #3: Scalable performance



- Can the interconnect scale performance linearly?
  - Are there limits or inflection points to interconnect scaling?
- Can the system scale incrementally?
- Can be the system be partitioned well?
  - Job isolation, QOS, jitter reduction
- Can messaging software scale to millions of endpoints?

## **HPC** network megatrends



- Ethernet commonality and in general, applicability to Cloud
- Higher switch radix and move to bandwidth-scalable topologies
- Increasing emphasis on offload
- Increasing level of optics integration

#### **Ethernet commonality**



- Increasing commonality with Ethernet
- PHY and I/O macro convergence
  - Signaling rate
  - SERDES, training
  - Error detection and correction techniques
- Main motivation for interface convergence is cost savings
  - Signaling at 50 Gb/s and higher is a huge challenge
  - Shared design, verification, fabrication, and testing costs

#### Ethernet commonality: one present implication



- The chosen Ethernet path for 50+ Gb/s signaling leverages PAM4 (Pulse Amplitude Modulation with 4 signal levels)
  - Doubles the signaling rate for the same Baud rate as NRZ (Non-Return to Zero)
- However, "eyes" are much narrower, and the impact of noise is relatively more pronounced
- Will typically require Forward Error Correction (FECs) to achieve an acceptable Bit Error Rate (BER)
  - Error detection and link-level retry are insufficient
- Increases latency on every hop, for checking and correction: ?? ns

## **Interconnect Architecture: topologies**



- Desire bandwidth-scalable topologies
- Take advantage of trend towards high-radix switches to flatten network
  - Fewer hops = reduced cost, energy, and latency
  - Disadvantages tori and similar nearest neighbor topologies
- Topology choice heavily influenced by costs and available technologies
  - Optics vs. electrical tradeoffs
  - Ratio of electrical links is decreasing with increasing signaling rate
  - For ≥100 Gb/s signaling, almost all links may be optical
    - Link and switch counts then become a good determiner of relative cost

#### **Topology options: Fat-tree**



- Assume K-port switches, and an L-level fat-tree
- Scales to N = 2(K/2)<sup>L</sup> endpoints = K<sup>3</sup>/4 for a 3-level Fat-tree
- Switches traversed = 2L 1 (5 switches for a 3-level fat-tree)
- Links per endpoint = L (3 links for a 3-level fat-tree)
- Switches per endpoint (full tree) = (2L 1)/K
- Bisection bandwidth = BN, where B is the unidirectional link bandwidth
- Partitions: integer multiples of sub-trees with the same "parents"
- Easily accommodates tapering for reduced cost at reduced bandwidth



© 2018 IBM Corporation

ExaComm 2018

## **Topology options: 2-tier Dragonfly**

K/4 ports to endpoints



K-port Dragonfly router:

K/4 ports to 2<sup>nd</sup> tier (global links)

2 tiers, with each tier fully connected

Dragonfly router

K/2 ports to 1<sup>st</sup> tier (local group)

- Scalable to K4/64 endpoints
- 4 or 6 router/switch traversals
- 3 virtual channels per class
- 2.5 links per endpoint
- 4/K switches per endpoint
- Bisection bandwidth scales as BN/2 (half of Fat-tree)
- Global bandwidth comparable to Fat-tree
- Non-interfering partition sizes only up to a full local group

Local group (1st-tier connections)

Full system (local groups connected via global links)

Direct path shown in green, indirect path shown in blue



## **Dragonfly routing: partitions and indirect paths**



#### Partition AD path shown in blue, Partition BCE path shown in blue



#### Topology options: stacked full mesh



- Simultaneously discovered by IBM and Fujitsu
- Names "multi-layer full mesh" by Fujitsu
- Scalable to K<sup>3</sup>/8 endpoints
- 3 or 5 switch traversals
- 2 virtual channels per class
- 2 links per endpoint
- ~3/K switches per endpoint
- Bisection bandwidth scales as ~BN/2 (half of Fat-tree)
- Global bandwidth comparable to Fat-tree
- Many isolated partition sizes possible



= Global switch
= local (TOR)
switch connecting
endpoints

Duplicate up to K/2 times to create K/2 Groups

C switches are really one switch, just as for D and F

A, B, and E lines omitted to avoid figure complexity

© 2018 IBM Corporation

ExaComm 2018

## **Topology options: stacked full mesh**





Figure from Fujitsu: http://www.fujitsu.com/global/about/resources/news/press-releases/2014/0715-02.html

© 2018 IBM Corporation

#### **Topology comparison table**



| Architecture        | Approximate<br>Max Scale<br>K=36, 48, 64 | Links Per<br>Endpoint | Switch Ports<br>Per Endpoint | Switch Traversals Direct, Worst Case Indirect | Range of partition sizes | Virtual<br>Channels per<br>Traffic Class |
|---------------------|------------------------------------------|-----------------------|------------------------------|-----------------------------------------------|--------------------------|------------------------------------------|
| 2-level<br>Fat-tree | 648, 1152,<br>2048                       | 2                     | 3                            | 3, NA                                         | Good                     | 1                                        |
| 3-level<br>Fat-tree | 11664, 27648,<br>64K                     | 3                     | 5                            | 5, NA                                         | Good                     | 1                                        |
| 4-level<br>Fat-tree | 205K, 648K,<br>2M                        | 4                     | 7                            | 7, NA                                         | Good                     | 1                                        |
| Stacked full mesh   | 5184, 12288,<br>29127                    | 2                     | 3                            | 3, 5                                          | Medium                   | 2                                        |
| 2-tier<br>Dragonfly | 26244, 82994,<br>256K                    | 2.5                   | 4                            | 4, 6                                          | Only within local group  | 3                                        |

- For fat-trees, 3 levels is the sweet spot balancing scale and complexity
- Stacked full mesh attractive within its scale (about half that of a 3-level fat-tree)

#### Offload: "SuperNIC" architecture





Dual protocol:

2. Ethernet

#### **Offload: Active Communications**



- Many interesting opportunities for processor offload
  - Active messages/transactions, including remote atomic operations
  - Complex collectives
  - Efficient gather/scatter
  - Message completion handling
  - Message aggregation
  - Send/receive messages without host processor involvement
    - Direct protocol hand-off to GPUs, other accelerators, smart storage, etc.

#### Active Communication: programmable vs. hardwired



- Programmable/configurable engine advantages:
  - More flexible function
  - More robust if there are design errors or oversights
  - Can support many functions with one unit
- Hardwired:
  - For the given function: more efficient, higher performance
- And FPGAs are in the middle of this spectrum

#### **Active Communication: location, location, location**



- In (or attached to) the NIC?
  - Minimizes dependence on host architecture
  - Can upgrade network independently
- Closer to the node memory?
  - Low-latency, high-bandwidth access to host memory
  - Efficient packed gather/scatter packet transfer over NIC-host bus
- Support in switches
  - Collective support provides clear advantages in latency
  - And can provide bandwidth advantages, depending on the implementation

## Remote Memory Transaction (RMT) investigation



- RMT request = Active Message
  - Initiates program execution on receiver node
  - Updates remote memory user data
- RMT engines in/near network interface
  - Many programmable engines
  - Tiny and power efficient
  - Optimized for data movement
- Network topology agnostic
- Near memory with low-latency, high-bandwidth access to entire address space



## Co-packaging: Changing Approach for Building Switches





- Avoid distortion, power, & cost of ASIC-interfacing electrical links
- © 2011 Move beyond chip & module pin-count limits

### Path to increased network bandwidth per link



- Data Rate

  - Ultimately limited by powerCan be mitigated by tight packaging
- VCSELs vs. Silicon Photonics
- Number of physical lanes
  - Increase number of fibers
  - Closely packed optical waveguides increases density
  - Multicore fiber can reduce fiber count by 4x or more
- PAM4
  - 4 signaling levels
  - Doubles signaling rate compared to NRZ
- WDM

  - CWDM with multimode possible for ~2-4 wavelengths
    Si Photonics for >4 wavelengths (also multi-km distance)









#### Summary



- Networks are aggressively targeting future HPC and Analytics challenges
- Scalable performance
  - Technology, topologies, messaging software
- Cost: public enemy number one
  - Technology & topologies
  - Leverage commodity when possible, exploit commonality with ethernet
- Low latency, high messaging rate, offload, collectives
  - Holistic, end-to-end design philosophy with codesigned messaging stack
  - Overlap communication with computation
  - Hardwired and programmable support in NICs, switches, and near-memory
  - Location matters: compute near data to minimize data movement

#### **CORAL: Summit compute rack InfiniBand components**





## **CORAL:** complete InfiniBand EDR network





# Thanks!

# Danke schön!

