#### GPU TECHNOLOGY CONFERENCE

adminimization

# **INSIDE VOLTA**

Olivier Giroux and Luke Durant NVIDIA May 10, 2017

## **VOLTA: A GIANT LEAP FOR DEEP LEARNING**

#### **ResNet-50 Training**

#### **ResNet-50 Inference**

TensorRT - 7ms Latency



V100 measured on pre-production hardware.

## **ROAD TO EXASCALE**

Volta to Fuel Most Powerful US Supercomputers



Volta HPC Application Performance

System Config Info: 2X Xeon E5-2690 v4, 2.6GHz, w/ 1X Tesla P100 or V100. V100 measured on pre-production hardware.



### **INTRODUCING TESLA V100**



#### The Fastest and Most Productive GPU for Deep Learning and HPC

### **TESLA V100**

21B transistors 815 mm<sup>2</sup>

80 SM 5120 CUDA Cores 640 Tensor Cores

16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink



\*full GV100 chip contains 84 SMs

### **GPU PERFORMANCE COMPARISON**

|                        | P100        | V100          | Ratio      |
|------------------------|-------------|---------------|------------|
| Training acceleration  | 10 TOPS     | 120 TOPS      | 12x        |
| Inference acceleration | 21 TFLOPS   | 120 TOPS      | <b>6</b> x |
| FP64/FP32              | 5/10 TFLOPS | 7.5/15 TFLOPS | 1.5x       |
| HBM2 Bandwidth         | 720 GB/s    | 900 GB/s      | 1.2x       |
| NVLink Bandwidth       | 160 GB/s    | 300 GB/s      | 1.9x       |
| L2 Cache               | 4 MB        | 6 MB          | 1.5x       |
| L1 Caches              | 1.3 MB      | 10 MB         | 7.7x       |

#### **NEW HBM2 MEMORY ARCHITECTURE**

1.5x Delivered



V100 measured on pre-production hardware.

# **VOLTA NVLINK**

300GB/sec 50% more links 28% faster signaling



#### PROGRAMMABILITY

# **PROGRAMMABILITY DRIVES DEEP LEARNING**

Deep Learning methods developed using CUDA



New solvers, new layers, new scaling techniques, new applications for old techniques, and much more...

# STATE OF UNIFIED MEMORY

#### High performance, low effort

Performance vs no Unified Memory

#### PGI OpenACC on Pascal P100

Geometric mean across all 15 SPEC ACCEL™ benchmarks

86% PCI-E, 91% NVLink

\*S7285 - Unified Memory on the Latest GPU Architectures



11 💿 DVIDIA

PGI 17.1 Compilers OpenACC SPEC ACCEL™ 1.1 performance measured March, 2017. SPEC® and the benchmark name SPEC ACCEL™ are registered trademarks of the Standard Performance Evaluation Corporation.

# PASCAL UNIFIED MEMORY





# **VOLTA + PCIE CPU UNIFIED MEMORY**





# **VOLTA + NVLINK CPU UNIFIED MEMORY**





## **GPU MULTI-PROCESS SCHEDULING**

Background



#### **Timeslice Scheduling**

Single-process throughput optimized

#### **Multi-Process Service**

Multi-process throughput optimized













Full process isolation, peak throughput optimized for each process

19 **NVIDIA**.

## PASCAL MULTI-PROCESS SERVICE



Opt-in: Limited isolation, peak throughput optimized across processes

## **VOLTA MULTI-PROCESS SERVICE**

#### Volta MPS Enhancements:

- Reduced launch latency
- Improved launch throughput
- Improved quality of service with scheduler partitioning
  - More reliable performance
- 3x more clients than Pascal



# **VOLTA MPS FOR INFERENCE**

Efficient inference deployment without batching system



22 📀 nvidia

V100 measured on pre-production hardware.

#### NEW SM MICROARCHITECTURE

# VOLTA GV100 SM

|                             | GV100  |
|-----------------------------|--------|
| FP32 units                  | 64     |
| FP64 units                  | 32     |
| INT32 units                 | 64     |
| Tensor Cores                | 8      |
| <b>Register File</b>        | 256 KB |
| Unified L1/Shared<br>memory | 128 KB |
| Active Threads              | 2048   |

| SM               |           |                       |           |           |           |           |               |           |           |           |           |           |           |           |           |                        |
|------------------|-----------|-----------------------|-----------|-----------|-----------|-----------|---------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|------------------------|
|                  |           |                       |           |           |           |           | L1 Instruc    | tion Cac  | he        |           |           |           |           |           |           |                        |
|                  |           | L0 Ir                 | nstruc    | tion C    | ache      |           |               |           |           |           | L0 Ir     | nstruc    | tion C    | ache      |           |                        |
|                  | Wai       | r <mark>p Sc</mark> h | edule     | r (32 t   | hread/    | clk)      |               |           |           | War       | p Sch     | edule     | r (32 t   | hread     | /clk)     |                        |
|                  | Di        | spatcl                | h Unit    | (32 th    | read/c    | lk)       |               |           |           | Dis       | spatch    | h Unit    | (32 th    | read/c    | :lk)      |                        |
|                  | Reg       | ister                 | File (′   | 16,384    | 4 x 32    | -bit)     |               |           |           | Reg       | ister     | File ('   | 16,384    | 4 x 32    | -bit)     |                        |
| FP64             | INT       | INT                   | FP32      | FP32      |           |           |               | FP        | 64        | INT       | INT       | FP32      | FP32      |           |           |                        |
| FP64             | INT       | INT                   | FP32      | FP32      |           |           |               | FP        | 64        | INT       | INT       | FP32      | FP32      |           |           |                        |
| FP64             | INT       | INT                   | FP32      | FP32      | $\vdash$  |           |               | FP        | 64        | INT       | INT       | FP32      | FP32      | $\vdash$  |           | $\mapsto$              |
| FP64             | INT       | INT                   | FP32      | FP32      | TEN       |           | TENSOR        | FP        | 64        | INT       | INT       | FP32      | FP32      |           | SOR       | TENSOR                 |
| FP64             | INT       | INT                   | FP32      | FP32      | со        | RE        | CORE          | FP        | 64        | INT       | INT       | FP32      | FP32      | co        | RE        | CORE                   |
| FP64             | INT       | INT                   | FP32      | FP32      |           |           |               | FP        | 64        | INT       | INT       | FP32      | FP32      | $\vdash$  |           |                        |
| FP64             | INT       | INT                   | FP32      | FP32      |           |           |               | FP        | 64        | INT       | INT       | FP32      | FP32      |           |           |                        |
| FP64             | INT       | INT                   | FP32      | FP32      |           |           |               | FP        | 64        | INT       | INT       | FP32      | FP32      | $\vdash$  |           | $\vdash \vdash \vdash$ |
| LD/ LD/<br>ST ST | LD/<br>ST | LD/<br>ST             | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | SFU           | LD/<br>ST | SFU                    |
|                  |           | L0 lr                 | nstruc    | tion C    | ache      |           |               |           |           |           | L0 Ir     | nstruc    | tion C    | ache      |           |                        |
|                  | Wai       | r <mark>p Sc</mark> h | edule     | r (32 t   | hread/    | clk)      |               |           |           |           |           |           | r (32 t   |           |           |                        |
|                  | Di        | spatcl                | h Unit    | (32 th    | read/c    | lk)       |               |           |           | Dis       | spatch    | n Unit    | (32 th    | read/c    | :lk)      |                        |
|                  | Reg       | ister                 | File (′   | 16,384    | 4 x 32    | -bit)     |               |           |           | Reg       | ister     | File ('   | 16,384    | 4 x 32    | -bit)     |                        |
| FP64             | INT       | INT                   | FP32      | FP32      |           |           |               | FP        | 64        | INT       | INT       | FP32      | FP32      | $\square$ |           |                        |
| FP64             | INT       | INT                   | FP32      | FP32      |           |           |               | FP        | 64        | INT       | INT       | FP32      | FP32      | +         |           |                        |
| FP64             | INT       | INT                   | FP32      | FP32      |           |           |               | FP        | 64        | INT       | INT       | FP32      | FP32      | +         |           |                        |
| FP64             | INT       | INT                   | FP32      | FP32      | TEN       |           | TENSOR        | FP        | 64        | INT       | INT       | FP32      | FP32      |           | SOR       | TENSOR                 |
| FP64             | INT       | INT                   | FP32      | FP32      | co        | RE        | CORE          | FP        | 64        | INT       | INT       | FP32      | FP32      | co        | RE        | CORE                   |
| FP64             | INT       | INT                   | FP32      | FP32      |           |           |               | FP        | 64        | INT       | INT       | FP32      | FP32      | +         |           |                        |
| FP64             | INT       | INT                   | FP32      | FP32      |           |           |               | FP        | 64        | INT       | INT       | FP32      | FP32      |           |           |                        |
| FP64             | INT       | INT                   | FP32      | FP32      |           |           |               | FP        | 64        | INT       | INT       | FP32      | FP32      | $\square$ |           |                        |
| LD/ LD/<br>ST ST | LD/<br>ST | LD/<br>ST             | LD/<br>ST | LD/<br>ST | LD/<br>ST | LD/<br>ST | SFU           | LD/<br>ST | SFU                    |
|                  |           |                       |           |           |           | 128KE     | B L1 Data Cao | he / Sha  | red Me    | mory      |           |           |           |           |           |                        |
|                  | Tex       |                       |           |           |           | Tex       |               |           |           | Гех       |           |           |           |           | Tex       |                        |

24 💿 nvidia.

# VOLTA GV100 SM

**Redesigned for Productivity** 

Completely new ISA

Twice the schedulers

Simplified Issue Logic

Large, fast L1 cache

Improved SIMT model

Tensor acceleration

The easiest SM to program yet

|                                                                |                                                      |                                               |                                                              |                                                             |                                                                                                                                                                                                                                        | L1 Instruc     | tion Cache                                                           |                                                     |                                                     |                                                             |                                                             |                          |               |
|----------------------------------------------------------------|------------------------------------------------------|-----------------------------------------------|--------------------------------------------------------------|-------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------|----------------------------------------------------------------------|-----------------------------------------------------|-----------------------------------------------------|-------------------------------------------------------------|-------------------------------------------------------------|--------------------------|---------------|
|                                                                |                                                      | L0 II                                         | nstruct                                                      | tion C                                                      | ache                                                                                                                                                                                                                                   |                |                                                                      |                                                     | L0 Ir                                               | nstruct                                                     | ion C                                                       | ache                     |               |
|                                                                | Wa                                                   | rp Sch                                        | eduler                                                       | r (32 t                                                     | hread/clk)                                                                                                                                                                                                                             |                | Warp Scheduler (32 thread/clk)                                       |                                                     |                                                     |                                                             |                                                             |                          |               |
|                                                                | Di                                                   | spatcl                                        | n Unit (                                                     | (32 th                                                      | read/clk)                                                                                                                                                                                                                              |                |                                                                      | Di                                                  | spatcl                                              | h Unit                                                      | (32 th                                                      | read/clk)                |               |
|                                                                | Reg                                                  | ister                                         | File (1                                                      | 6,384                                                       | 1 x 32-bit)                                                                                                                                                                                                                            |                |                                                                      | Reg                                                 | ister                                               | File (1                                                     | 6,384                                                       | 4 x 32-bit)              |               |
| FP64                                                           | INT                                                  | INT                                           | FP32                                                         | FP32                                                        |                                                                                                                                                                                                                                        |                | FP64                                                                 | INT                                                 | NT INT FP32                                         |                                                             | FP32                                                        |                          |               |
| FP64                                                           | INT                                                  | INT                                           | FP32                                                         | FP32                                                        |                                                                                                                                                                                                                                        |                | FP64                                                                 | INT                                                 | INT                                                 | FP32                                                        | FP32                                                        |                          |               |
| FP64                                                           | INT                                                  | INT                                           | FP32                                                         | FP32                                                        |                                                                                                                                                                                                                                        |                | FP64                                                                 | INT                                                 | INT                                                 | FP32                                                        | FP32                                                        | TENSOR                   |               |
| FP64                                                           | INT                                                  | INT                                           | FP32                                                         | FP32                                                        | TENSOR                                                                                                                                                                                                                                 |                | FP64                                                                 | INT                                                 | INT                                                 | FP32                                                        | FP32                                                        |                          | TENSO         |
| FP64                                                           | INT                                                  | INT                                           | FP32                                                         | FP32                                                        | CORE                                                                                                                                                                                                                                   | CORE           | FP64                                                                 | INT                                                 | INT                                                 | FP32                                                        | FP32                                                        | CORE                     | CORE          |
| FP64                                                           | INT                                                  | INT                                           | FP32                                                         |                                                             |                                                                                                                                                                                                                                        |                | FP64                                                                 | INT                                                 | INT                                                 | FP32                                                        |                                                             |                          |               |
| FP64                                                           | INT                                                  | INT                                           | FP32                                                         |                                                             |                                                                                                                                                                                                                                        |                | FP64                                                                 | INT                                                 | INT                                                 | FP32                                                        |                                                             |                          |               |
| FP64                                                           | INT                                                  | INT                                           | FP32                                                         | FP32                                                        |                                                                                                                                                                                                                                        |                | FP64                                                                 | INT                                                 | INT                                                 | FP32                                                        | FP32                                                        |                          |               |
| LD/ LD<br>ST ST                                                | / LD/<br>ST                                          | LD/<br>ST                                     | LD/<br>ST                                                    | LD/<br>ST                                                   | LD/ LD/<br>ST ST                                                                                                                                                                                                                       | SFU            | LD/ LD/<br>ST ST                                                     | LD/<br>ST                                           | LD/<br>ST                                           | LD/<br>ST                                                   | LD/<br>ST                                                   | LD/ LD/<br>ST ST         | SFU           |
|                                                                |                                                      |                                               |                                                              |                                                             | hread/clk)<br>read/clk)                                                                                                                                                                                                                |                |                                                                      |                                                     |                                                     |                                                             | · ·                                                         | hread/clk)<br>read/clk)  |               |
|                                                                |                                                      |                                               |                                                              |                                                             | t x 32-bit)                                                                                                                                                                                                                            |                |                                                                      |                                                     |                                                     |                                                             |                                                             | fead/cik)<br>4 x 32-bit) |               |
|                                                                |                                                      |                                               |                                                              |                                                             |                                                                                                                                                                                                                                        |                |                                                                      | _                                                   |                                                     |                                                             |                                                             |                          |               |
| FP64                                                           | INT                                                  | INT                                           | FP32                                                         | <b>FP32</b>                                                 | $ \rightarrow \rightarrow$ |                | FP64                                                                 | INT                                                 | INT                                                 | FP32                                                        | FP32                                                        |                          |               |
| FP64<br>FP64                                                   | INT<br>INT                                           | INT<br>INT                                    | FP32<br>FP32                                                 |                                                             |                                                                                                                                                                                                                                        |                | FP64<br>FP64                                                         | INT<br>INT                                          | INT<br>INT                                          | FP32<br>FP32                                                |                                                             |                          |               |
|                                                                |                                                      |                                               |                                                              | FP32                                                        |                                                                                                                                                                                                                                        |                |                                                                      |                                                     |                                                     |                                                             | FP32                                                        |                          |               |
| FP64                                                           | INT                                                  | INT                                           | FP32                                                         | FP32<br>FP32                                                | TENSOF                                                                                                                                                                                                                                 |                | FP64                                                                 | INT                                                 | INT                                                 | FP32                                                        | FP32<br>FP32                                                | TENSOR                   |               |
| FP64<br>FP64                                                   | INT<br>INT                                           | INT<br>INT                                    | FP32<br>FP32                                                 | FP32<br>FP32<br>FP32                                        | TENSOF                                                                                                                                                                                                                                 | TENSOR<br>CORE | FP64<br>FP64                                                         | INT<br>INT                                          | INT<br>INT                                          | FP32<br>FP32                                                | FP32<br>FP32<br>FP32                                        | TENSOR<br>CORE           |               |
| FP64<br>FP64<br>FP64                                           | INT<br>INT<br>INT                                    | INT<br>INT<br>INT                             | FP32<br>FP32<br>FP32                                         | FP32<br>FP32<br>FP32<br>FP32                                |                                                                                                                                                                                                                                        |                | FP64<br>FP64<br>FP64                                                 | INT<br>INT<br>INT                                   | INT<br>INT<br>INT                                   | FP32<br>FP32<br>FP32                                        | FP32<br>FP32<br>FP32<br>FP32                                |                          |               |
| FP64<br>FP64<br>FP64<br>FP64                                   | INT<br>INT<br>INT<br>INT                             | INT<br>INT<br>INT<br>INT                      | FP32<br>FP32<br>FP32<br>FP32                                 | FP32<br>FP32<br>FP32<br>FP32<br>FP32                        |                                                                                                                                                                                                                                        |                | FP64<br>FP64<br>FP64<br>FP64                                         | INT<br>INT<br>INT<br>INT                            | INT<br>INT<br>INT                                   | FP32<br>FP32<br>FP32<br>FP32                                | FP32<br>FP32<br>FP32<br>FP32<br>FP32                        |                          |               |
| FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64                   | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT        | INT<br>INT<br>INT<br>INT<br>INT<br>INT        | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32         | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32        | CORE                                                                                                                                                                                                                                   |                | FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64                 | INT<br>INT<br>INT<br>INT<br>INT<br>INT              | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT       | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32        | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32        | CORE                     |               |
| FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64                   | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT        | INT<br>INT<br>INT<br>INT<br>INT               | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                 | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                |                                                                                                                                                                                                                                        |                | FP64<br>FP64<br>FP64<br>FP64<br>FP64                                 | INT<br>INT<br>INT<br>INT<br>INT                     | INT<br>INT<br>INT<br>INT<br>INT                     | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32                |                          | TENSO<br>CORE |
| FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>LD/ LC | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>INT | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>LD/ | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32 | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>LD/ | LD/<br>ST ST                                                                                                                                                                                                                           | CORE           | FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>FP64<br>LD/<br>ST ST | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>LD/<br>ST | INT<br>INT<br>INT<br>INT<br>INT<br>INT<br>LD/<br>ST | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>LD/ | FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>FP32<br>LD/ | LD/ LD/                  | CORE          |

25 📀 nvidia

## **RECAP: PASCAL L1 AND SHARED MEMORY**



#### **UNIFYING KEY TECHNOLOGIES**



## **VOLTA L1 AND SHARED MEMORY**

Volta Streaming L1\$ :

Unlimited cache misses in flight Low cache hit latency 4x more bandwidth 5x more capacity

Volta Shared Memory :

Unified storage with L1 Configurable up to 96KB



# NARROWING THE SHARED MEMORY GAP

#### with the GV100 L1 cache

Cache: vs shared

- Easier to use
- 90%+ as good

Shared: vs cache

- Faster atomics
- More banks
- More predictable



#### INDEPENDENT THREAD SCHEDULING

### **VOLTA: INDEPENDENT THREAD SCHEDULING**

#### **Communicating Algorithms**





**Pascal: Lock-Free Algorithms** Threads cannot wait for messages Volta: Starvation Free Algorithms

Threads may wait for messages



```
__device__ void insert_after(Node *a, Node *b)
{
    Node *c;
    lock(a); lock(a->next);
    c = a->next;
    a->next = b;
    b->prev = a;
    b->next = c;
    c->prev = b;
    unlock(c); unlock(a);
}
```

#### Doubly-Linked List with Fine Grained Lock



```
__device__ void insert_after(Node *a, Node *b)
{
    Node *c;
    lock(a); lock(a->next);
    c = a->next;
    a->next = b;
    b->prev = a;
    b->next = c;
    c->prev = b;
    unlock(c); unlock(a);
}
```

Doubly-Linked List with Fine Grained Lock



\*Not shown: lock() implementation 33 🚳 DVIDIA

```
__device__ void insert_after(Node *a, Node *b)
{
    Node *c;
    lock(a); lock(a->next);
    c = a->next;
    a->next = b;
    b->prev = a;
    b->next = c;
    c->prev = b;
    unlock(c); unlock(a);
}
```

#### Doubly-Linked List with Fine Grained Lock



```
__device__ void insert_after(Node *a, Node *b)
{
    Node *c;
    lock(a); lock(a->next);
    c = a->next;
    a->next = b;
    b->prev = a;
    b->prev = a;
    b->next = c;
    c->prev = b;
    unlock(c); unlock(a);
}
```

#### Doubly-Linked List with Fine Grained Lock



```
__device__ void insert_after(Node *a, Node *b)
{
    Node *c;
    lock(a); lock(a->next);
    c = a->next;
    a->next = b;
    b->prev = a;
    b->next = c;
    c->prev = b;
    unlock(c); unlock(a);
}
```

#### Doubly-Linked List with Fine Grained Lock



Tip! Volta can run <u>163,840</u> threads simultaneously

Minimize Contention!

#### WARP IMPLEMENTATION

#### Pre-Volta



32 thread warp



#### PASCAL WARP EXECUTION MODEL

if (threadIdx.x < 4) {
 A;
 B;
} else {
 X;
 Y;
}</pre>





### PASCAL WARP EXECUTION MODEL



#### WARP IMPLEMENTATION

#### **Pre-Volta**



32 thread warp

#### Volta



32 thread warp with independent scheduling

# **VOLTA WARP EXECUTION MODEL**



# **VOLTA WARP EXECUTION MODEL**



Software synchronization also supported, e.g. locks for doubly-linked list!

# **VOLTA'S EXTENDED SIMT MODEL**

The SIMT model:

enable thread-parallel programs to execute with vector efficiency

|                    | CPU  | Pascal GPU          | Volta GPU |
|--------------------|------|---------------------|-----------|
| Thread-parallelism | MIMD | SIMT<br>(lock-free) | SIMT      |
| Data-parallelism   | SIMD | SIMT                | SIMT      |

#### COOPERATION -> SYNCHRONIZATION

See also:

CUDA 9.0, S7132, Wednesday 230pm

Cooperative Groups, S7622, Wednesday 4pm

# SHUFFLE SYNCHRONIZATION

\_\_\_\_shfl\_sync - deprecates \_\_\_shfl



## **COMPARE SYNCHRONIZATION**

\_\_ballot\_sync, \_\_[any|all]\_sync - deprecate namesakes \_\_match[any|all]\_sync, \_\_activemask - new



### **VOLTA TENSOR CORE**



# **TENSOR CORE**

#### Mixed Precision Matrix Math 4x4 matrices



D = AB + C

## **TENSOR SYNCHRONIZATION**

#### Full Warp 16x16 Matrix Math



Warp-synchronizing operation

Composed Matrix Multiply and Accumulate for **16x16** matrices

Result distributed across warp

#### **VOLTA TENSOR OPERATION**



Also supports FP16 accumulator mode for inferencing

#### **USING TENSOR CORES**

}



NVIDIA cuDNN, cuBLAS, TensorRT

#### Volta Optimized Frameworks and Libraries

\_\_device\_\_ void tensor\_op\_16\_16\_16( float \*d, half \*a, half \*b, float \*c) {

wmma::fragment<matrix\_a, ...> Amat; wmma::fragment<matrix\_b, ...> Bmat; wmma::fragment<matrix\_c, ...> Cmat;

wmma::load\_matrix\_sync(Amat, a, 16); wmma::load\_matrix\_sync(Bmat, b, 16); wmma::fill\_fragment(Cmat, 0.0f);

wmma::mma\_sync(Cmat, Amat, Bmat, Cmat);

#### CUDA C++ Warp-Level Matrix Operations

#### A GIANT LEAP FOR DEEP LEARNING



## **INTRODUCING TESLA V100**



More V100 Features: 2x L2 atomics, int8, new memory model, copy engine page migration, and more ...

The Fastest and Most Productive GPU for Deep Learning and HPC