# Exploring Emerging Memory Technologies in Extreme Scale High Performance Computing

Jeffrey S. Vetter

Presented to

ISC Third International Workshop on Communication Architectures for HPC, Big Data, Deep Learning and Clouds at Extreme Scale

Frankfurt

22 Jun 2017





# Oak Ridge National Laboratory is the DOE Office of Science's Largest Lab



# ORNL is Characterized by Diverse Scientific Portfolio



**National Laboratory** 

## What Does 1000x Increase Provide? Mission Impact

### Cosmology



Salman Habib Argonne National Laboratory

Habib and collaborators used its HACC Code on Titan's CPU–GPU system to conduct today's largest cosmological structure simulation at resolutions needed for modernday galactic surveys.

K. Heitmann, 2014. arXiv.org, 1411.3396

### Combustion



Jacqueline Chen Sandia National Laboratory

Chen and collaborators for the first time performed direct numerical simulation of a jet flame burning dimethyl ether (DME) at new turbulence scales over space and time.

A. Bhagatwala, et al. 2014. *Proc. Combust. Inst.* **35**.

# Superconducting Materials



Paul Kent ORNL

Paul Kent and collaborators performed the first ab initio simulation of a cuprate. They were also the first team to validate quantum Monte Carlo simulations for high-temperature superconductor simulations.

K. Foyevtsova, et al. 2014. *Phys. Rev. X* **4** 

# Molecular Science



Michael Klein Temple University

Researchers at Procter & Gamble (P&G) and Temple University delivered a comprehensive picture in full atomistic detail of the molecular properties that drive skin barrier disruption.

M. Paloncyova, et al. 2014. *Langmuir* **30** 

C. M. MacDermaid, et al. 2014. *J. Chem. Phys.* **141** 

### **Fusion**



Chang and collaborators used the XGC1 code on Titan to obtain fundamental understanding of the divertor heat-load width physics and its dependence on the plasma current in present-day tokamak devices.

C. S. Chang, et al. 2014. Proceedings of the 25th Fusion Energy Conference, IAEA, October 13–18, 2014.



## Highlights

- Recent trends in extreme-scale HPC paint an ambiguous future
  - Contemporary systems provide evidence that power constraints are driving architectures to change rapidly (e.g., Dennard, Moore)
  - Multiple architectural dimensions are being (dramatically) redesigned: Processors, node design, memory systems, I/O
- Memory systems are changing now!
  - New devices
  - New integration
  - New configurations
  - Vast (local) capacities
- Programming systems must provide performance portability (in addition to functional portability)!!
  - We need new programming systems to effectively use these architectures
  - NVL-C
  - Papyrus(KV)
- Changes in memory systems will alter communication and storage requirements dramatically



# Major Trends in Computing



# Sixth Wave of Computing







## Contemporary devices are approaching fundamental limits



Dennard scaling has already ended. Dennard observed that voltage and current should be proportional to the linear dimensions of a transistor: 2x transistor count implies 40% faster and 50% more efficient.

R.H. Dennard, F.H. Gaensslen, V.L. Rideout, E. Bassous, and A.R. LeBlanc, "Design of ion-implanted MOSFET's with very small physical dimensions," *IEEE Journal of Solid-State Circuits*, 9(5):256-68, 1974,



Figure 1 | As a metal oxide-semiconductor field effect transistor (MOSFET) shrinks, the gate dielectric (yellow) thickness approaches several atoms (0.5 nm at the 22-nm technology node). Atomic spacing limits the



Figure 2 | As a MOSFET transistor shrinks, the shape of its electric field departs from basic rectilinear models, and the level curves become disconnected. Atomic-level manufacturing variations, especially for dopant



# Semiconductors are taking longer and cost more to design and produce



### Intel's 'Tick-Tock' Seemingly Dead, Becomes 'Process-Architecture-Optimization'



by Ian Cutress on March 22, 2016 6:45 PM EST

Posted in CPUs Intel 14nm 10nm EUV Lithography Tick-Tock Process-Architecture-Optimization



As reported at The Motley Fool, Intel's latest 10-K / annual report filing would seem to suggest that the 'Tick-Tock' strategy of introducing a new lithographic process note in one product cycle (a 'tick') and then an upgraded microarchitecture the next product cycle (a 'tock') is going to fall by the wayside for the next two lithographic nodes at a minimum, to be replaced with a three element cycle known as 'Process-Architecture-Optimization'.

Intel's Tick-Tock strategy has been the bedrock of their microprocessor dominance of the last decade. Throughout the tenure, every other year Intel would upgrade their fabrication plants to be able to produce processors with a smaller feature set, improving die area, power consumption, and slight optimizations of the microarchitecture, and in the years between the upgrades would launch a new set of processors based on a wholly new (sometimes paradigm shifting) microarchitecture for large performance upgrades. However, due to the difficulty of implementing a 'tick', the ever decreasing process node size and complexity therein, as reported previously with 14nm and the introduction of Kaby Lake, Intel's latest filing would suggest that 10nm will follow a similar pattern as 14nm by introducing a third stage to the cadence.





# Semiconductor business is highly process-oriented, optimized, and growing extremely capital intensive

designlines INDUSTRIAL CONTROL

#### **News & Analysis**

### Semi industry fab costs limit industry growth

Nicolas Mokhoff 10/3/2012 03:00 PM EDT Post a comment

NO RATINGS LOGIN TO RATE











MANHASSET, N.Y. -- The fundamental economics of the semiconductor industry may start changing sooner rather than later, according to market research firm Gartner Inc.

The costs of staying at the leading edge in semiconductor manufacturing are rising. Semiconductor manufacturers need to plan on equipment costs increasing at about 15 percent for each new node, according to Gartner (Stamford, Conn.).

It's possible that 450-mm manufacturing will achieve the goal of 35 percent cost reduction. But that equates to only three or four years of increasing equipment costs, and consequently, delays the inevitable, Gartner said. It is also possible that new technologies will emerge that will sow the rate of cost increases, according to the firm.

According to Gartner, the costs of manufacturing equipment needed for leadingedge semiconductor manufacturing are increasing at a rate between 7 percent and 10 percent per year, depending on the basic process.

By 2020, current cost trends will lead to an average cost of between \$15 billion and \$20 billion for a leading-edge fab according to the report. By 2016, the minimum capital expenditure budger needed to justify the building of a new fab will range from \$8 billion to \$10 billion for logic, \$3.5 billion to \$4.5 billion for DRAM and \$6 billion to \$7 billion for NAND flash, according to the report.

The Gartner report predicts that at current spending rates, only eight companies could afford to build fabs in the next few years

By 2020, current cost trends will lead to an average cost of between \$15 billion and \$20 billion for a leading-edge fab, according to the report. By 2016, the minimum capital expenditure budget needed to justify the building of a new fab will range from \$8 billion to \$10 billion for logic, \$3.5 billion to \$4.5 billion for DRAM and \$6 billion to \$7 billion for NAND flash, according to the report.

Context: Intel Reports Full-Year Revenue of \$55.4 Billion, Net Income of \$11.4 Billion (Intel SEC Filing for FY2015)

#### Major 2013 IC Foundries (Pure-Play and IDM)

| 2013<br>Rank | 2012<br>Rank | Company          | Foundry<br>Type | Location    | 2011 Sales<br>(\$M) | 2012 Sales<br>(\$M) | 2012/2011<br>Change (%) | 2013 Sales<br>(\$M) | 2013/2012<br>Change (%) |
|--------------|--------------|------------------|-----------------|-------------|---------------------|---------------------|-------------------------|---------------------|-------------------------|
| 1            | 1            | TSMC             | Pure-Play       | Taiwan      | 14,299              | 16,951              | 19%                     | 19,850              | 17%                     |
| 2            | 2            | GlobalFoundries  | Pure-Play       | U.S.        | 3,195               | 4,013               | 26%                     | 4,261               | 6%                      |
| 3            | 3            | UMC              | Pure-Play       | Taiwan      | 3,760               | 3,730               | -1%                     | 3,959               | 6%                      |
| 4            | 4            | Samsung          | IDM             | South Korea | 2,192               | 3,439               | 57%                     | 3,950               | 15%                     |
| 5            | 5            | SMIC*            | Pure-Play       | China       | 1,320               | 1,542               | 17%                     | 1,973               | 28%                     |
| 6            | 8            | Powerchip**      | Pure-Play       | Taiwan      | 374                 | 625                 | 67%                     | 1,175               | 88%                     |
| 7            | 9            | Vanguard         | Pure-Play       | Taiwan      | 520                 | 582                 | 12%                     | 713                 | 23%                     |
| 8            | 6            | Huahong Grace*** | Pure-Play       | China       | 619                 | 677                 | 9%                      | 710                 | 5%                      |
| 9            | 10           | Dongbu           | Pure-Play       | South Korea | 500                 | 540                 | 8%                      | 570                 | 6%                      |
| 10           | 7            | TowerJazz        | Pure-Play       | Israel      | 611                 | 639                 | 5%                      | 509                 | -20%                    |
| 11           | 11           | IBM              | IDM             | U.S.        | 420                 | 432                 | 3%                      | 485                 | 12%                     |
| 12           | 12           | MagnaChip        | IDM             | South Korea | 350                 | 400                 | 14%                     | 411                 | 3%                      |
| 13           | 13           | WIN              | Pure-Play       | Taiwan      | 304                 | 381                 | 25%                     | 354                 | -7%                     |
| _            | _            | Top 13 Total     |                 | _           | 28,464              | 33,951              | 19%                     | 38,920              | 15%                     |
|              |              | Top 13 Share     |                 |             | 89%                 | 90%                 |                         | 91%                 |                         |
|              |              | Other Foundry    |                 |             | 3,446               | 3,669               | 6%                      | 3,920               | 7%                      |
| _            | _            | Total Foundry    | _               | _           | 31,910              | 37,620              | 18%                     | 42,840              | 14%                     |

\*Does not include Wuhan Xinxin (now XMC) for 2012 or 2013.



<sup>\*\*</sup>Powerchip transitioned from an IDM foundry to a pure-play foundry in 2013.

<sup>\*\*\*</sup>Hua Hong NEC and Grace merged in 2012 (excludes Shanghai Huali).

## Business climate reflects this uncertainty, cost, complexity, consolidation



ARM Holdings accepted a company into a new, Saudi-backed \$100bn investment fund.

is extending to roughly 2.5 to 3 years.

acquired in 2015, which could at present mean a charge against profit reaching \$4 billion

### End of Moore's Law

- Device level physics will prevent much smaller feature size of current transistor technologies
- Business trends indicate asymptotic limits of both manufacturing capability and economics

• What now?







# Sixth Wave of Computing





### **Our Transition Period Predictions**

# Optimize Software and Expose New Hierarchical Parallelism

- Redesign software to boost performance on upcoming architectures
- Exploit new levels of parallelism and efficient data movement

# Architectural Specialization and Integration

- Use CMOS more efficiently for our workloads
- Integrate components to boost performance and eliminate inefficiencies

## **Emerging Technologies**

- Investigate new computational paradigms
  - Quantum
  - Neuromorphic
  - Advanced Digital
  - Emerging Memory Devices



### **Our Transition Period Predictions**

# Optimize Software and Expose New Hierarchical Parallelism

- Redesign software to boost performance on upcoming architectures
- Exploit new levels of parallelism and efficient data movement

# Architectural Specialization and Integration

- Use CMOS more efficiently for our workloads
- Integrate components to boost performance and eliminate inefficiencies

### **Emerging Technologies**

- Investigate new computational paradigms
  - Quantum
  - Neuromorphic
  - Advanced Digital
  - Emerging Memory Devices



## Architectural specialization will accelerate

- Vendors, lacking Moore's Law, will need to continue to differentiate products (to stay in business)
- Grant that advantage of better CMOS process stalls
- Use the same transistors differently to enhance performance
- Architectural design will become extremely important, critical
  - Dark Silicon
  - Address new parameters for benefits/curse of Moore's Law



https://www.thebroadcastbridge.com/content/entry/1094/altera-announces-arria-10-2666mbps-ddr4-memory-fpga-interface

## Intel's Nervana AI platform takes aim at Nvidia's GPU techology

Firm claims Xeon-based chips will deliver a '100-fold increase' in deep learning performance



**CHIPMAKER** Intel has set out its plans for artificial intelligence (AI) and claimed that it will reduce the time to train a deep learning model by up to 100 times within the next three years.

At the forefront of the firm's AI ambitions is the Intel Nervana platform, which was announced on Thursday following Intel's acquisition of deep learning startup Nervana Systems earlier this year.

http://www.theinquirer.net/inquirer/news/2477796/intels-nervana-ai-platform-takes-aim-at-nvidias-gpu-techology





https://fossbytes.com/nvidia-volta-cddr6,2018/RIDGE

GOOGLE BUILT ITS VERY OWN CHIPS TO POWER ITS AI BOTS



GOOGLE

GOOGLE HAS DESIGNED its own computer chip for driving deep neural networks, an AI technology that is reinventing the way Internet services operate.

This morning at Google I/O, the centerpiece of the company's year, CEO Sundar Pichai said that Google has designed an ASIC, or application-specific integrated circuit, that's specific to deep neural nets. These are networks of

http://www.wired.com/2016/05/google-tpu-custom-chips/

## Tighter integration and manufacturing of components will provide some benefits: components with different processes, functionality; local bandwidth





http://www.anandtech.com/show/9969/jedec-publishes-hbm2-specification



### **Our Transition Period Predictions**

# Optimize Software and Expose New Hierarchical Parallelism

- Redesign software to boost performance on upcoming architectures
- Exploit new levels of parallelism and efficient data movement

# Architectural Specialization and Integration

- Use CMOS more efficiently for our workloads
- Integrate components to boost performance and eliminate inefficiencies

### **Emerging Technologies**

- Investigate new computational paradigms
  - Quantum
  - Neuromorphic
  - Advanced Digital
  - Emerging Memory Devices



# Exploration of emerging technologies is different this time – I promise

- Three decades of alternative technologies have fallen victim to 'curse of Moore's law': general CPU performance improvements without any software changes
  - Weitek Floating Point accelerator (circa 1988)
  - Piles of other types of processors: clearspeed,
  - FPGAs
- Some of these technologies found a specific market to serve
  - But most failed
- Now, the context and parameters have changed!



https://micro.magnet.fsu.edu/optics/olympusmicd/galleries/chips/weitekmathmedium.html



http://www.clearspeed.com



## Transition Period will be Disruptive

- New devices and architectures may not be hidden in traditional levels of abstraction
  - A new type of CNT transistor may be completely hidden from higher levels
  - A new paradigm like quantum may require new architectures, programming models, and algorithmic approaches

 Solutions need a co-design framework to evaluate and mature specific technologies

| Layer       | Switch, 3D | NVM | Approximate | Neuro | Quantum |
|-------------|------------|-----|-------------|-------|---------|
| Application | 1          | 1   | 2           | 2     | 3       |
| Algorithm   | 1          | 1   | 2           | 3     | 3       |
| Language    | 1          | 2   | 2           | 3     | 3       |
| API         | 1          | 2   | 2           | 3     | 3       |
| Arch        | 1          | 2   | 2           | 3     | 3       |
| ISA         | 1          | 2   | 2           | 3     | 3       |
| Microarch   | 2          | 3   | 2           | 3     | 3       |
| FU          | 2          | 3   | 2           | 3     | 3       |
| Logic       | 3          | 3   | 2           | 3     | 3       |
| Device      | 3          | 3   | 2           | 3     | 3       |

Adapted from IEEE Rebooting Computing Chart



# Quantum computing: Qubit design and fabrication have made recent

Science 354, 1091 (2016) - 2 December

#### A bit of the action

progress

In the race to build a quantum computer, companies are pursuing many types of quantum bits, or qubits, each with its own strengths and weaknesses.











#### Superconducting loops

A resistance-free current oscillates back and forth around a circuit loop. An injected microwave signal excites the current into superposition states.

Longevity (seconds) 0.00005

#### Trapped ions

Electrically charged atoms, or ions, have quantum energies that depend on the location of electrons. Tuned lasers cool and trap the ions, and put them in superposition states.

>1000

Slow operation. Many

quantum state.

#### Silicon quantum dots

These "artificial atoms" are made by adding an electron to a small piece of pure silicon, Microwaves control the electron's

#### Topological qubits

Ouasiparticles can be seen in the behavior of electrons channeled through semiconductor structures. Their braided paths can encode quantum information.

Existence not yet

#### Diamond vacancies

A nitrogen atom and a vacancy add an electron to a diamond lattice. Its quantum spin state, along with those of nearby carbon nuclei. can be controlled with light.

Difficult to

- Logic success rate 99.9% 99.2% Number entangled 14 N/A Company support Microsoft. Quantum Diamond Google, IBM, Quantum Circuits Technologies Pros Fast working. Build on existing Stable. Build on existing Greatly reduce Very stable, Highest Can operate at semiconductor industry. achieved gate fidelities. semiconductor industry. errors. room temperature.
- lasers are needed. be kept cold. Must be kept cold. confirmed. entangle. Note: Longevity is the record coherence time for a single qubit superposition state, logic success rate is the highest reported gate fidelity for logic operations on two qubits, and number entangled is the maximum number of qubits entangled and capable of performing two-qubit operations.

Only a few entangled.

- Technological progress
  - Demonstrated qubits
- IBM, Google, Microsoft, Intel, D-Wave, etc investing in quantum
  - IBM has 17-qubit quantum computer on cloud
    - https://www.research.ibm.com/ibmq/
- DOF has released to RFPs for quantum computing



Cons

Collapse easily and must

# New Memory Devices and Systems



## **New Memory Devices and Systems**

- HMC, HBM/2/3, LPDDR4, GDDR5X, WIDEIO2, etc
- 2.5D, 3D Stacking
- New devices (ReRAM, PCRAM, STT-MRAM, Xpoint)
- Configuration diversity
  - Fused, shared memory
  - Scratchpads
  - Write through, write back, etc
  - Consistency and coherence protocols
  - Virtual v. Physical, paging strategies



Copyright (c) 2014 Hiroshige Goto All rights reserved.

http://gigglehd.com/zbxe/files/attach/images/1404665/988/406/011/788d3ba1967e2db3817d259d2e83c88e\_1.jpg



https://www.micron.com/~/media/track-2-images/content-images/content image hmc.ipg?la=en

|                             | SRAM    | DRAM    | eDRAM   | 2D NAND<br>Flash | 3D NAND<br>Flash | PCRAM                             | STTRAM | 2D ReRAM | 3D ReRAM |
|-----------------------------|---------|---------|---------|------------------|------------------|-----------------------------------|--------|----------|----------|
| Data Retention              | N       | N       | N       | Y                | Y                | Y                                 | Y      | Y        | Y        |
| Cell Size (F2)              | 50-200  | 4-6     | 19-26   | 2-5              | <1               | 4-10                              | 8-40   | 4        | <1       |
| Minimum F demonstrated (nm) | 14      | 25      | 22      | 16               | 64               | 20                                | 28     | 27       | 24       |
| Read Time (ns)              | < 1     | 30      | 5       |                  | 10 <sup>4</sup>  | 10-50                             | 3-10   | 10-50    | 10-50    |
| Write Time (ns)             | < 1     | 50      | 5       | 105              | 10 <sup>5</sup>  | 100-300                           | 3-10   | 10-50    | 10-50    |
| Number of Rewrites          | 1016    | 1016    | 1016    |                  |                  | 10 <sup>6</sup> -10 <sup>10</sup> | 1015   | 108-1012 | 108-1012 |
| Read Power                  | Low     | Low     | Low     | High             | High             | Low                               | Medium | Medium   | Medium   |
| Write Power                 | Low     | Low     | Low     | High             | High             | High                              | Medium | Medium   | Medium   |
| Power (other than R/W)      | Leakage | Refresh | Refresh | None             | None             | None                              | None   | Sneak    | Sneak    |
| Maturity                    |         |         |         |                  |                  |                                   |        |          |          |

J.S. Vetter and S. Mittal, "Opportunities for Nonvolatile Memory Systems in Extreme-Scale High Performance Computing," CISE, 17(2):73-82, 2015.



Fig. 4. (a) A typical 1T1R structure of RRAM with HfO<sub>x</sub>; (b) HR-TEM image of the TiN/Ti/HfO<sub>x</sub>/TiN stacked layer; the thickness of the HfO<sub>2</sub> is 20 nm.

H.S.P. Wong, H.Y. Lee, S. Yu et al., "Metal-oxide RRAM," Proceedings of the IEEE, 100(6):1951-70, 2012.



# NVRAM Technology Continues to Improve – Driven by Market Forces



designlines MEMORY











designlines wireless & Networking

Original URL: http://www.theregister.co.uk/2013/11/01/hp memristor 2018/ HP 100TB Memristor drives by 2018 - if you're lucky, admits tech titan Universal memory slow in coming

Blocks and Files HP has warned El Reg not to get its hopes up too high after the tech titan's CTO Martin Fink suggested StoreServ arrays could be packed with 100TB Memristor drives come 2018.

IBM scientists have broken new ground in the c In five years, according to Fink, DRAM and NAND scaling will hit a wall, limiting the maximum capacity of the technologies: process shrinks will come to a shuddering halt when the memories' reliability drops off a cliff as a side effect of reducing the size of electronics on the silicon dies.

> The HP answer to this scaling wall is Memristor, its flavour of resistive RAM technology that is supposed to have DRAM-like speed and better-than-NAND storage density. Fink claimed at an HP Discover event in Las Vegas that Memristor devices will be ready by the time flash NAND hits its limit in five years. He also showed off a Memristor wafer, adding that it could have a 1.5PB capacity by the end of the decade.

3D NAND variant stakes out high-end SSDs

designlines MEMORY I

News & Analysis

#### Samsung Debuts 3D XPoint Killer

Rick Merritt 8/11/2016 00:01 AM EDT LOGIN TO RATE

Like 56 Tweet in Share 212 G+1 4

SANTA CLARA, Calif. - Samsung lobbed a new variant of its 3D NAND flash into the gap Intel and Micron hope to fill with their emerging 3D XPoint memory. The news came one day after Micron showed at the Flash Memory Summit performance figures for its version of the XPoint solid-state drives (SSDs) under a new Quantx

Samsung announced plans for what it called Z-NAND chips that will power SSDs with similar performance but lower costs and risk than the 3D XPoint drives. However, it was secretive about the details of the technology that will appear in products sometime next year.

By contrast, a Micron engineer leading its XPoint SSD program was surprisingly candid in an interview with EE Times. She described current prototypes using early XPoint chips and an FPGA-based controller for the SSDs expected to ship in about a year.

Samsung's Z-NAND will deliver 10x faster reads than multi-level cell flash and writes that are twice as fast, the company said. At the drive level, they will support both reads and writes at about 20 microseconds, suggesting some of write performance comes from an enhanced controller.

# Comparison of Emerging Memory Technologies

|                             |           |                  | Doploye | nd                               |                                  | Evnori                            | montal           | ontol       |             |
|-----------------------------|-----------|------------------|---------|----------------------------------|----------------------------------|-----------------------------------|------------------|-------------|-------------|
|                             | <u> </u>  |                  | Deploye | <del></del>                      | <u> </u>                         | Experimental                      |                  |             |             |
|                             | SRAM      | DRAM             | eDRAM   | 2D<br>NAND<br>Flash              | 3D<br>NAND<br>Flash              | PCRAM                             | STTRAM           | 2D<br>ReRAM | 3D<br>ReRAM |
| Data Retention              | N         | N                | N       | Y                                | Y                                | Y                                 | Y                | Y           | Y           |
| Cell Size (F <sup>2</sup> ) | 50-200    | 4-6              | 19-26   | 2-5                              | <1                               | 4-10                              | 8-40             | 4           | <1          |
| Minimum F demonstrated (nm) | 14        | 25               | 22      | 16                               | 64                               | 20                                | 28               | 27          | 24          |
| Read Time (ns)              | < 1       | 30               | 5       | $10^{4}$                         | $10^{4}$                         | 10-50                             | 3-10             | 10-50       | 10-50       |
| Write Time (ns)             | < 1       | 50               | 5       | $10^{5}$                         | $10^{5}$                         | 100-300                           | 3-10             | 10-50       | 10-50       |
| Number of Rewrites          | $10^{16}$ | 10 <sup>16</sup> | 1016    | 10 <sup>4</sup> -10 <sup>5</sup> | 10 <sup>4</sup> -10 <sup>5</sup> | 10 <sup>8</sup> -10 <sup>10</sup> | 10 <sup>15</sup> | 108-1012    | 108-1012    |
| Read Power                  | Low       | Low              | Low     | High                             | High                             | Low                               | Medium           | Medium      | Medium      |
| Write Power                 | Low       | Low              | Low     | High                             | High                             | High                              | Medium           | Medium      | Medium      |
| Power (other than R/W)      | Leakage   | Refresh          | Refresh | None                             | None                             | None                              | None             | Sneak       | Sneak       |
| Maturity                    |           |                  |         |                                  |                                  |                                   |                  |             |             |

Intel/Micron Xpoint? Samsung Z-NAND?



# Aggressively-pursued NVM Technologies Continue to Improve



Figure 3: Average bit density



Figure 11: Sequential read BW



# Microcosm of DOE HPC Architectures are Reflecting these Trends



## Projections: Exascale architecture targets circa 2009

2009 Exascale Challenges Workshop in San Diego

### Attendees envisioned two possible architectural swim lanes:

- 1. Homogeneous many-core thin-node system
- 2. Heterogeneous (accelerator + CPU) fat-node system

| System attributes    | 2009     | "Pre-           | Exascale" | "Exascale"      |           |  |
|----------------------|----------|-----------------|-----------|-----------------|-----------|--|
| System peak          | 2 PF     | 100-            | 200 PF/s  | 1 Exaflop/s     |           |  |
| Power                | 6 MW     | 1               | 5 MW      | 20 MW           |           |  |
| System memory        | 0.3 PB   |                 | 5 PB      | 32-64 PB        |           |  |
| Storage              | 15 PB    | 1               | 50 PB     | 500 PB          |           |  |
| Node performance     | 125 GF   | 0.5 TF 7 TF     |           | 1 TF            | 10 TF     |  |
| Node memory BW       | 25 GB/s  | 0.1 TB/s 1 TB/s |           | 0.4 TB/s        | 4 TB/s    |  |
| Node concurrency     | 12       | O(100)          | O(1,000)  | O(1,000)        | O(10,000) |  |
| System size (nodes)  | 18,700   | 500,000         | 50,000    | 1,000,000       | 100,000   |  |
| Node interconnect BW | 1.5 GB/s | 150 GB/s        | 1 TB/s    | 250 GB/s 2 TB/s |           |  |
| IO Bandwidth         | 0.2 TB/s | 10 TB/s         |           | 30-60 TB/s      |           |  |
| MTTI                 | day      | 0               | (1 day)   | O(0.1 day)      |           |  |



# Architectural specialization and integration will create very complex platforms – no two systems alike

- Architectural specialization and integration will create very complex platforms
  - Heterogeneous computing
  - Deep memory hierarchies incl NVM
  - Plateauing I/O forces app redesign
- Implications
  - DOE and companies will need to understand design tradeoffs
  - Programming systems will need to be portable
  - Performance portability will be a stretch goal

| System attributes       | NERS<br>C<br>Now                              | OLCF<br>Now                            | ALCF<br>Now                   | NERSC<br>Upgrade                                                         | OLCF<br>Upgrade                                                                  | ALCF U                                                 | lpgrades                                                                    |
|-------------------------|-----------------------------------------------|----------------------------------------|-------------------------------|--------------------------------------------------------------------------|----------------------------------------------------------------------------------|--------------------------------------------------------|-----------------------------------------------------------------------------|
| Planned<br>Installation | Edison                                        | TITAN                                  | MIRA                          | Cori<br>2016                                                             | Summit<br>2017-2018                                                              | Theta<br>2016                                          | Aurora<br>2018-2019                                                         |
| System peak (PF)        | 2.6                                           | 27                                     | 10                            | > 30                                                                     | 150                                                                              | >8.5                                                   | 180                                                                         |
| Peak Power (MW)         | 2                                             | 9                                      | 4.8                           | < 3.7                                                                    | 10                                                                               | 1.7                                                    | 13                                                                          |
| Total system memory     | 357 TB                                        | 710TB                                  | 768TB                         | ~1 PB DDR4 + High Bandwidth Memory (HBM)+1.5PB persistent memory         | > 1.74 PB<br>DDR4 +<br>HBM + <del>2.8</del><br>3.7 ~7 PB<br>persistent<br>memory | >480 TB DDR4<br>+ High<br>Bandwidth<br>Memory (HBM)    | > 7 PB High Bandwidth On- Package Memory Local Memory and Persistent Memory |
| Node performance (TF)   | 0.460                                         | 1.452                                  | 0.204                         | > 3                                                                      | > 40                                                                             | > 3                                                    | > 17 times Mira                                                             |
| Node processors         | Intel<br>Ivy<br>Bridge                        | AMD<br>Opter<br>on<br>Nvidia<br>Kepler | 64-bit<br>PowerP<br>C A2      | Intel Knights Landing many core CPUs Intel Haswell CPU in data partition | Multiple IBM<br>Power9<br>CPUs &<br>multiple<br>Nvidia Voltas<br>GPUS            | Intel Knights<br>Landing Xeon<br>Phi many core<br>CPUs | Knights Hill<br>Xeon Phi many<br>core CPUs                                  |
| System size (nodes)     | 5,600<br>nodes                                | 18,68<br>8<br>nodes                    | 49,152                        | 9,300 nodes<br>1,900 nodes in<br>data partition                          | ~3,500<br>nodes                                                                  | >2,500 nodes                                           | >50,000 nodes                                                               |
| System<br>Interconnect  | Aries                                         | Gemin<br>i                             | 5D<br>Torus                   | Aries                                                                    | Dual Rail<br>EDR-IB                                                              | Aries                                                  | 2 <sup>nd</sup> Generation<br>Intel Omni-Path<br>Architecture               |
| File System             | 7.6 PB<br>168<br>GB/s,<br>Lustre <sup>®</sup> | 32 PB<br>1<br>TB/s,<br>Lugtre          | 26 PB<br>300<br>GB/s<br>GPFS™ | 28 PB<br>744 GB/s<br>Lustre <sup>®</sup>                                 | 120 PB<br>1 TB/s<br>GPFS™                                                        | 10PB, 210<br>GB/s Lustre<br>initial                    | 150 PB<br>1 TB/s<br>Lustre <sup>®</sup>                                     |



# Migration up the hierarchy





# Programming NVM on a Node



## **NVM Architectural Assumptions**





- 3D XPoint technology provides the benefit in the middle
- It is considerably faster than NAND Flash
- Performance can be realized on PCIe or DDR buses
- Lower cost per bit than DRAM while being considerably more dense



## **HPC Application Scenarios for NVM**

• Burst Buffers, C/R [Liu, et al., MSST 2012]



In situ visualization



http://ft.ornl.gov/eavl

In-mem tables





# Observations: Numerous characteristics of applications are a good match for byte-addressable NVRAM

# Empirical results show many reasons...

- Lookup, index, and permutation tables
- Inverted and 'element-lagged' mass matrices
- Geometry arrays for grids
- Thermal conductivity for soils
- Strain and conductivity rates
- · Boundary condition data
- Constants for transforms, interpolation
- MC Tally tables, cross-section materials tables...

### large scale NVM into applications

- Transactional protocol to preserve local and global consistency and redundancy
- · Name indexing
- · Object attributes allow local caching







Figure 3: Read/write ratios, memory reference rates and memory object sizes for memory objects in Nek5000

# NVL-C: extending C to support NVM





## Design Goals: Familiar programming interface

```
#include <nvl.h>
struct list {
  int value;
  nvl struct list *next;
};
void add(int k, nvl struct list *after) {
  nvl struct list *node
    = nvl alloc nv(heap, 1, struct list);
  node->value = k;
  node->next = after->next;
  after->next = node;
```

- Small set of C language extensions:
  - Header file
  - Type qualifiers
  - Library API
  - Pragmas
- Existing memory interfaces remain:
  - NVL-C is a superset of C
  - Unqualified types as specified by C
  - Local/global variables stored in volatile memory (DRAM or registers)
  - Use existing C standard libraries for HDD



#### Design Goals: Avoiding persistent data corruption

- New categories of pointer bugs:
  - Caused by multiple memory types:
    - E.g., pointer from NVM to volatile memory will become dangling pointer
  - Prevented at compile time or run time
- Automatic reference counting:
  - No need to manually free
  - Avoids leaks and dangling pointers
- Transactions:
  - Avoids persistent data corruption across software and hardware failures

- High performance:
  - Performance penalty from memory management, pointer safety, and transactions
  - Compiler-based optimizations
  - Programmer-specified hints



#### Design Goals: Modular implementation



- Core is common compiler middle-end
- Multiple complier front ends for multiple high-level languages:
  - For now, just OpenARC for NVL-C
- Multiple runtime implementations:
  - For now, just Intel's pmem (pmemobj)



#### **Programming Model: NVM Pointers**

```
#include <nvl.h>
struct list {
  int value;
  nvl struct list *next;
};
void add(int k, nvl struct list *after) {
  struct list *node
    = malloc(sizeof(struct list));
  node->value = k;
  node->next = after->next;
  after->next = node;
                           compile-time error
                          explicit cast won't help
```

- **nv1** type qualifier:
  - Indicates NVM storage
  - On target type, declares NVM pointer
  - No NVM-stored local or global variable
- Stricter type safety for NVM pointers:
  - Does not affect other C types
  - Avoids persistent data corruption
  - Facilitates compiler analysis
  - Needed for automatic reference counting
  - E.g., pointer conversions involving NVM pointers are strictly prohibited



#### Programming Model: Pointer types (like Coburn et al.)



avoids dangling pointers when memory segments close



#### Programming Model: Transactions: Undo logs

```
#include <nvl.h>
void matmul(nvl float a[I][J],
            nvl float b[I][K],
            nvl float c[K][J],
            nvl int *i)
  while (*i<I) {
    #pragma nvl atomic heap(heap)
      for (int j=0; j< J; ++j) {
        float sum = 0.0;
        for (int k=0; k < K; ++k)
         sum += b[*i][k] * c[k][j];
        a[*i][j] = sum;
      ++*i;
```

- Before every NVM store, transaction creates undo log to back up old data
- Undo log contains metadata plus old data being overwritten
- Problem: large overhead because an undo log is created for every element of a (every iteration of j loop)



#### **Evaluation: LULESH**

- backup is important for performance
- clobber cannot be applied because old data is needed



- ExM = use SSD as extended DRAM
- T1 = BSR + transactions
- T2 = T1 + backup clauses
- T3 = T1 + clobber clauses
- BlockNVM = msync included
- ByteNVM = msync suppressed



# Programming Scalable NVM with Papyrus(KV)



#### Scalable NVM Architectural Assumptions

- NVM Architectures will vary dramatically -> Portability
  - Exploit persistence?
  - Where in the hierarchy?
    - Already in external storage system
    - Rack mounted appliance (Cori)
    - Chassis shared?
    - Node shared? (Summit)
- Design Goals
  - Performance
    - Must provide performance better than or equal to alternatives for providing large shared storage for HPC apps
  - Scalability (to millions of threads)
  - Interoperability with existing HPC programming models
    - Can be incrementally introduced
    - Leverage features of other programming models
  - Application customizability
    - Usage scenarios vary dramatically
    - Tunable consistency model, protection attributes (e.g., RO)





### PapyrusKV: A High-Performance Parallel Key-Value Store for Distributed NVM Architectures

- Leverage emerging NVM technologies
  - High performance
  - High capacity
  - Persistence property
- Designed for the next-generation DOE/NNSA systems
  - Portable across local NVM and dedicated NVM architectures
  - An embedded key-value store (no system-level daemons and servers)
- Designed for HPC applications
  - MPI/UPC-interoperable
  - Application customizability
    - Memory consistency models (sequential and relaxed)
    - Protection attributes (read-only, write-only, read-write)
    - Load balancing
  - Zero-copy workflow, asynchronous checkpoint/restart



Interconnection network

PapyrusKV stores keys and values in arbitrary byte arrays across multiple NVM devices in a distribute system



PapyrusKV is portable across local NVM and dedicated NVM architectures

National Laboratory

### PapyrusKV Application API

Table 1: The PKV API.

| API Function                                                                                            | Description                                                                                                                                                                        | Collective |  |
|---------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------|--|
| (a) Environment                                                                                         |                                                                                                                                                                                    |            |  |
| pkv_init(int* argc, char*** argv, const char* repository)                                               | Initialize execution environment using repository path                                                                                                                             |            |  |
| pkv_finalize()                                                                                          | Terminate execution environment                                                                                                                                                    |            |  |
| (b) Basic                                                                                               |                                                                                                                                                                                    |            |  |
| pkv_open(const char* name, int flags, pkv_option_t* opt, pkv_db_t* db)                                  | Open or create db with name                                                                                                                                                        |            |  |
| pkv_close(pkv_db_t db)                                                                                  | Close db                                                                                                                                                                           |            |  |
| <pre>pkv_put(pkv_db_t db, const char* key, size_t keylen, const char* value,<br/>size_t valuelen)</pre> | Insert or update a key-value pair to db                                                                                                                                            |            |  |
| pkv_get(pkv_db_t db, const char* key, size_t keylen, char** value, size_t* valuelen)                    | Retrieve value for a given key from db. If value is not allocated in memory, PKV allocates a new heap region from the PKV memory pool. Otherwise, data is copied to value directly |            |  |
| pkv_delete(pkv_db_t db, const char* key, size_t keylen)                                                 | Delete a key-value pair for a given key from db                                                                                                                                    |            |  |
| pkv_free(pkv_db_t db, char* val)                                                                        | Release a heap memory region allocated from the PKV memory pool                                                                                                                    |            |  |
| (c) Consistency                                                                                         |                                                                                                                                                                                    |            |  |
| pkv_signal_notify(int signum, int* ranks, int count)                                                    | Send signals to ranks                                                                                                                                                              |            |  |
| pkv_signal_wait(int signum, int* ranks, int count)                                                      | Wait for signals from ranks                                                                                                                                                        |            |  |
| pkv_fence(pkv_db_t db)                                                                                  | Migrate the remote MemTable and immutable MemTables to the owner ranks immedi-<br>ately                                                                                            |            |  |
| pkv_barrier(pkv_db_t db, int level)                                                                     | Collective memory fence with a flushing level (PKV_MEMTABLE or PKV_SSTABLE). With PKV_SSTABLE level, the whole db data are flushed to SSTables.                                    |            |  |
| pkv_consistency(pkv_db_t db, int mode)                                                                  | Set memory consistency mode on db to mode (PKV_SEQUENTIAL or PKV_RELAXED)                                                                                                          |            |  |
| pkv_protect(pkv_db_t db, int prot)                                                                      | Set protection attribute on db to prot (PKV_RDWR, PKV_WRONLY, or PKV_RDONLY)                                                                                                       |            |  |
| (d) Persistence                                                                                         |                                                                                                                                                                                    |            |  |
| pkv_checkpoint(pkv_db_t db, const char* path, pkv_event_t* event)                                       | Generate a snapshot of db into path. It runs asynchronously if event is not NULL                                                                                                   |            |  |
| pkv_restart(const char* path, const char* name, int prot, pkv_db_t* db,                                 | Revert db with name from a snapshot stored in path. It runs asynchronously if event is                                                                                             |            |  |
| pkv_event_t* event)                                                                                     | not NULL                                                                                                                                                                           |            |  |
| pkv_destroy(pkv_db_t db, pkv_event_t* event)                                                            | Remove $db$ and all its data from NVM. It runs asynchronously if <i>event</i> is not NULL                                                                                          | ×          |  |
| pkv_wait(pkv_db_t db, pkv_event_t event)                                                                | Wait for event to complete                                                                                                                                                         | ×          |  |

#### PapyrusKV Example Get operations





#### **Evaluation**

 Evaluation results on OLCF's SummitDev, TACC's Stampede (KNL), and NERSC's Cori



Figure 7: Put operation performance in relaxed (Rel) and sequential (Seq) consistency modes. B refers to Barrier.



Figure 8: Get operation performance. SG and B refer to Storage Group and SSTable Binary search, respectively.



Figure 10: Checkpoint, restart, and restart with redistribution (RD) performance.



Figure 11: Performance comparisons with MDHIM on Summitdev. NVMe (N) and Lustre (L) are used for their data storages.

#### ECP Application Case Study 1: Meraculous (UPC)

- A parallel De Bruijin graph construction and traversal for De Novo genome assembly
  - ExaBiome, Exascale Solutions for Microbiome Analysis, LBNL



Graphic from ExaBiome: Exascale Solutions to Microbiome Analysis (LBNL, LANL, JGI), 2017

Table 1: Source lines of code.

| Source file          | UPC  | UPC+PapyrusKV |
|----------------------|------|---------------|
| meraculous.c         | 469  | 475 (+6)      |
| buildUFXhashBinary.h | 315  | 173 (-143)    |
| kmer_hash.h          | 457  | 129 (-328)    |
| UU_traversal_final.h | 1754 | 1724 (-30)    |
| Modified Total       | 2995 | 2501 (-494)   |
| Grand Total          | 5971 | 5477 (-494)   |



Figure 5: Distributed hash table implementations in UPC and PapyrusKV. \*The same user hash function in the UPC application can be used in PapyrusKV.

National Laboratory

#### ECP Application Case Study 2: HACC (MPI)

- An N-body cosmology code framework
  - ExaSky, Computing the Sky at Extreme Scales, ANL



Graphic from HACCing the Universe on the BG/Q (ANL), 2014



Figure 7: Two-phases checkpointing. PapyrusKV reduces the I/O overhead with help from fast access of NVM. Asynchronous checkpoint hides the I/O overhead between NVM and parallel file system from the application.



## Implications for Interconnects and Storage



#### Predictions for Interconnects and Storage

- 1. Device and architecture trends will have major impacts on HPC in coming decade
  - 1. NVM in HPC systems is real!
- 2. Performance trends of system components will create new opportunities
- 3. Sea of NVM allows applications to operate differently
  - 1. Sea of NVM will permit applications to run for weeks without doing I/O to external storage system
  - 2. Applications will simply access local/remote NVM
  - 3. Longer term productive I/O will be 'occasionally' written to Lustre, GPFS
  - 4. Checkpointing (as we know it) will disappear
- 4. Requirements for interconnection networks will change
  - 1. Increase in byte-addressable memory-like message sizes and frequencies
  - 2. Reduced traditional IO demands
  - 3. KV traffic could have considerable impact need more applications evidence



#### Summary

- Recent trends in extreme-scale HPC paint an ambiguous future
  - Contemporary systems provide evidence that power constraints are driving architectures to change rapidly (e.g., Dennard, Moore)
  - Multiple architectural dimensions are being (dramatically) redesigned: Processors, node design, memory systems, I/O
- Memory systems are changing now!
  - New devices
  - New integration
  - New configurations
  - Vast (local) capacities
- Programming systems must provide performance portability (in addition to functional portability)!!
  - We need new programming systems to effectively use these architectures
  - NVL-C
  - Papyrus(KV)
- Changes in memory systems will alter communication and storage requirements dramatically



#### Acknowledgements



#### Contributors and Sponsors

- Future Technologies Group: <a href="http://ft.ornl.gov">http://ft.ornl.gov</a>
- US Department of Energy Office of Science
  - DOE Vancouver Project: https://ft.ornl.gov/trac/vancouver
  - DOE Blackcomb Project: <a href="https://ft.ornl.gov/trac/blackcomb">https://ft.ornl.gov/trac/blackcomb</a>
  - DOE ExMatEx Codesign Center: <a href="http://codesign.lanl.gov">http://codesign.lanl.gov</a>
  - DOE Cesar Codesign Center: <a href="http://cesar.mcs.anl.gov/">http://cesar.mcs.anl.gov/</a>
  - DOE Exascale Efforts: <a href="http://science.energy.gov/ascr/research/computer-science/">http://science.energy.gov/ascr/research/computer-science/</a>
- Scalable Heterogeneous Computing Benchmark team: <u>http://bit.ly/shocmarx</u>
- US National Science Foundation Keeneland Project: <u>http://keeneland.gatech.edu</u>
- US DARPA
- NVIDIA CUDA Center of Excellence





### **Bonus Material**

