

#### ADDING CUSTOM CUDA C++ OPERATIONS IN TENSORFLOW FOR BOOSTING BERT INFERENCE

Minseok Lee, Developer Technology Engineer, 2<sup>nd</sup> July

#### **CUDA** NVIDIA's Parallel Computing Platform and Programming Model

DOMAIN-SPECIFIC

VISUAL

PROCESSING

• Language Integration - C/C++, Fortran, ...

Integrated Development Environment

• Doman Specific Libraries

LINEAR ALGEBRA MATH ALGORITHMS CUDA Math library CUDA Math library

M also and

🕅 + 🕅

High Performance

#### **TENSORFLOW** An End-to-End Open Source Deep Learning Framework

• Easy/Flexible Model Building based on Python and Keras

• Robust ML Production Anywhere

GPU Accelerated Performance based on CUDA



### CUDA KNOWLEDGE + TENSORFLOW

**Customized Performance Synergy** 

• Help analyze and understand GPU-related behavior

e.g., Am I fully utilizing my GPU(s)? If not, what is the bottleneck?

- Enable to tune and squeeze training/inference performance
  - e.g., Increase the parallelism of CUDA kernel mapped to a TF Op
  - e.g., Implement a new optimized operation for your case

• It sounds great, but how can it be enabled?



## AGENDA

• What is TensorFlow Custom Op

• Case Study: BERT SQuAD Inference

• Tips and Other Options

### **TENSORFLOW CUSTOM C++ OP**

Interface to Add New Operations beyond Existing TensorFlow Library

Motivation:

- Difficult/Impossible to express your operation as a composition of existing ones
- The composite one doesn't have decent performance
- The existing op is not efficient for your use case

Custom C++ Op is one of the sensible options to customize TF's feature and performance

#### CUSTOM C++ OP INCORPORATION Bob Ross Style Guideline - That Easy, Right?

1. Define (or Register) Op's interface in C++ (op's name, input/output and their shapes, ...)

2. Implement Op (or Kernel) in CUDA C++ (override OpKernel::Compute to call the kernel)

3. Implement Gradient in Python (not necessary if you only focus on Inference)

4. Build its shared library and use it in your Python code

# CASE STUDY: BERT SQUAD INFERENCE

#### WHY JUMPING INTO BERT SUDDENLY? To Apply Custom Ops to BERT

• To provide a pragmatic example rather than a boring "Hello, World!" style example

• Transformer and BERT are being hyped everywhere nowadays

#### WHAT IS BERT? Bidirectional Encoder Representations from Transformers

A new method of pre-training language representations for a wide array of NLP tasks

Model Architecture is a multi-layer bidirectional Transformer encoder which embraces

- Multi-Head Attention
- Fully Connected Feed Forward with a GELU activation

"Intermediate" sub-layer in the code

Residual Connections



#### TARGET CONFIGURATION Let's focus on BERT SQuAD Inference Case

Batch size and sequence length can be varied across difference tasks and environments

• Based on what you want, the best optimization approach can be varied

BERT-Large checkpoint fine tuned for SQuAD is used

- 24-layer, 1024-hidden, 16-head
- max\_seq\_length: 384, batch\_size: 8 (default from <u>NVIDIA GitHub repo</u>)

For the sake of simplicity, only the inference case is covered

# FIRST CUSTOM OP: GELU

## **GELU ACTIVATION FUNCTION**

Why and How to Make its Custom Op

Google's Implementation in modeling.py -

```
def gelu(x):
    cdf = 0.5 * (1.0 + tf.tanh(
        (np.sqrt(2 / np.pi) * (x + 0.044715 * tf.pow(x, 3)))))
    return x * cdf
```

Single input, single output function, e.g., out[4] = gelu(in[4])

• Easy to write in Python by compositing existing TF ops

But how about its performance? How many CUDA kernels does it execute?

Let's trying profiling!

#### **PROFILING GELU**

#### Result based on NVIDIA Visual Profiler (NVVP)



GELU activation in Python results in 8 CUDA kernels in C++

• Their aggregated runtime is almost similar to W\*x+b!

#### **PERFORMANCE ANALYSIS**

#### Why Multi-Kernel GELU is So Slow?



Each kernel reads the input array x and writes the output array y

- Total 8 reads and 8 writes for the same arrays!
- What if we can read and write once? Kernel fusion

## STEP 1. REGISTER OP'S INTERFACE IN C++

Specify Name, Inputs, Outputs, Attributes and etc

gelu\_op.cc

```
#include "tensorflow/core/framework/op.h"
#include "tensorflow/core/framework/shape inference.h"
namespace tensorflow {
using GPUDevice = Eigen::GpuDevice;
REGISTER OP( "Gelu0p") Op name
  .Attr "T: type" Attr name: "T", type: "type"
  . Input "in: T" Input name: "in", type: "T"
  .Output ["out: T"] Output name: "out", type: "T"
  .SetShapeFn([](shape inference::InferenceContext *c) {
       c->set output(0, c->input(0)); set 0<sup>th</sup> output shape to 0<sup>th</sup> input shape
       return Status::OK();
   });
```

## STEP 2.1. INHERIT OPKERNEL IN C++

Override Compute function to do CUDA calls

gelu\_op.cc



### **STEP 2.1. INHERIT OPKERNEL IN C++**

Override Compute function to do CUDA calls

template <typename Device, typename T>
class GeluOp : public OpKernel {
 public:
 explicit GeluOp(OpKernelConstruction\* context): OpKernel(context) {}
 void Compute(OpKernelContext\* context) override {

```
functor_.run(
    context->eigen_device<Device>(),
    input_ptr,
    output_ptr,
    num_elements);
```

Do what you want by using the device, in/out pointers and parameters

#### private:

```
functor::Gelu0pFunctor<Device, T> functor_;
```

### **STEP 2.2. REGISTER IMPLEMENTATION**

Specify Device and Type Constraints

gelu\_op.cc

#ifdef GOOGLE\_CUDA
#define REGISTER\_GPU(T) 
 REGISTER\_KERNEL\_BUILDER(
 Name("GeluOp").Device(DEVICE\_GPU).TypeConstraint<T>("T"), \
 GeluOp<GPUDevice, T>)
REGISTER\_GPU(float);
#undef REGISTER\_GPU

#endif // GOOGLE\_CUDA

- Device = Eigen::GPUDevice
- TypeContraint<T>("T"): attr "T" must be T
- REGISTER\_GPU(float): it works only when T is float

#### CUDA Kernel Configuration and Launch

| gelu_op.cu.cc                                                                                                                                                 |  |  |  |  |
|---------------------------------------------------------------------------------------------------------------------------------------------------------------|--|--|--|--|
| #ifdef GOOGLE_CUDA<br>#define EIGEN_USE_GPU                                                                                                                   |  |  |  |  |
| <pre>#include "gelu_op.h" The template class declaration of GeluOpFunctor</pre>                                                                               |  |  |  |  |
| <pre>#include "gelu_kernel.h"</pre>                                                                                                                           |  |  |  |  |
| <pre>#include "tensorflow/core/framework/op_kernel.h"</pre>                                                                                                   |  |  |  |  |
| <pre>namespace tensorflow {</pre>                                                                                                                             |  |  |  |  |
| namespace functor {                                                                                                                                           |  |  |  |  |
| <pre>using GPUDevice = Eigen::GpuDevice;</pre>                                                                                                                |  |  |  |  |
| <pre>template <typename t=""> Partial specialization for GPUDevice struct Gelu0pFunctor<gpudevice, t=""> {    Gelu0pFunctor() {</gpudevice,></typename></pre> |  |  |  |  |
| <pre>int device;<br/>cudaGetDevice(&amp;device);<br/>cudaDeviceGetAttribute(#_devices_, cudaDevAttrMultiProcessorCount, device);</pre>                        |  |  |  |  |
| }<br>;;                                                                                                                                                       |  |  |  |  |

**CUDA Kernel Configuration and Launch** 

gelu\_op.cu.cc template <typename T> struct GeluOpFunctor<GPUDevice, T> { . . . void run(const GPUDevice& d, const T\* in, T\* out, int n\_elements) { GeluKernelLauncher(in, out, n\_elements, num\_devices , d.stream(); Pass # SMs Pass CUDA Stream int num devices ; }; template struct GeluOpFunctor<GPUDevice, float>; Template instantiation for float //namespace functor } //namespace tensorflow #endif

Slight Optimization: Balance between Parallelism and Iteration

gelu\_op.cu

Why We decide # threads based on # SMs, not # elements?

• A single GPU can run (# MAX threads per SM \* # SMs) threads concurrently

e.g., V100 has 80 SMs and each SM can run up to 2048 threads (=163,840 threads) If a kernel has more threads, it runs the first 163,840 threads with the others pended

• To minimize the inter-thread redundant operations, e.g., np.sqrt(2 / np.pi)

Let's make each thread handle multiple elements

#### **CUDA Kernel Implementation**

| $\leftarrow$ # Threads $\longrightarrow$             |                                 |                                                                                         |             |  |
|------------------------------------------------------|---------------------------------|-----------------------------------------------------------------------------------------|-------------|--|
| iteration 0                                          | iteration 1                     | iteration 2                                                                             | iteration 3 |  |
| gelu_op.cu                                           |                                 |                                                                                         |             |  |
| <pre>int gid = blockIdx.x int chunk_size = blo</pre> | <pre>* blockDim.x + threa</pre> |                                                                                         |             |  |
| T x = in[i];                                         |                                 | e) { 4715) * (x * x * x)); Efficient than pow(x, 3) ble calculation is done in register |             |  |

## STEP 3.1. BUILD OP SHARED LIBRARY

Generate a SO file from \*.cc and \*.cu

No matter how you build the code, e.g., CMake, clearly specify the following information

- Tensorflow header/library file location
  - e.g., tf.sysconfig.get\_include() or /usr/local/lib/python3.5/dist-packages/tensorflow/include
- Library dependencies
  - e.g., -lcublas, -lcudart, -tensorflow\_framework
- -D\_GLIBCXX\_USE\_CXX11\_ABI=0 (Omitting it leads to undefined symbol error for GCC >= 5.0)
- -DGOOGLE\_CUDA=1
- --std=C++11 --expt-relaxed-constexpr --expt-extended-lambda

## STEP 3.2. LOAD AND USE OP IN PYTHON

How to Bind C++ code to Python code



The relationship between Python op name and C++ op name

- CamelCase in C++ to snake\_case in Python
- e.g., if C++ op name is GeluOp, Python op name is gelu\_op

#### **PROFILING FUSED GELU**

#### Result based on NVIDIA Visual Profiler (NVVP)



# SECOND CUSTOM OP: MULTI-HEAD ATTENTION

## **MULTI-HEAD ATTENTION**

Most Important Function in Transformer



Legends:

- B: batch size (number of sequences)
- N: number of attention heads
- S: sequence length
- H: size of each attention head

How many CUDA kernels it leads to?

### **PROFILING MULTI-HEAD ATTENTION**

Result based on NVIDIA Visual Profiler (NVVP)



scale, mask and soft max results in 4 CUDA kernels in C++

- Their aggregated runtime is even longer than the two gemm kernels!
- Kernel fusion can help again

#### STEP 1. REGISTER OP'S INTERFACE IN C++

Specify Name, Inputs, Outputs, Attributes



• How to implement its shape function is in Appendix

## **STEP 2.1. INHERIT OPKERNEL CLASS**

#### Initialization and Finalization



## STEP 2.2. FUSE SCALE WITH GEMM

#### How to Use CUBLAS API



cublas General Matrix Multiplication (GEMM) APIs support in-register scaling

- $C = s \cdot (A \times B)$
- C is accessed only once for the final write

#### Data Layout After Step 2.2



A Possible Parallelization Approach (suppose B=8, S=384)



• Parallelize for sequence length (S) and batch size (B)

Data accessed by A Thread Block (suppose B=8, S=384)



• Total (B x S) thread blocks = 3072 thread blocks

#### Set Block Size to 128



• If S=384, each block has **3** iterations within the same row

Reuse Mask across N Attention Heads in Batch



• Let each thread load a mask into a local variable (stored in register)

That's why we didn't parallelize across N attention heads (will be revisited in later slide)

Parallel Row Reduction to Get Sum (1/3)

 $\leftarrow$  block size(=128)  $\rightarrow$ 



Parallel Row Reduction to Get Sum (2/3)



Parallel Row Reduction to Get Sum (3/3)

 $\leftarrow$  warp size(=32)  $\rightarrow$ 



Partial sum (in register)

### Broadcast Sum and Do Softmax



• Everything was done in register or shared memory

Read "in" and "mask" once and write "out" once to global memory

Repeat for N Attention Heads



## **STEP 3. USE CUSTOM OP IN PYTHON**

How to Substitute Original Implementation



## **PROFILING FUSED MULTI-HEAD ATTENTION**

Result based on NVIDIA Visual Profiler (NVVP)



## **END-TO-END PERFORMANCE**

### SQuAD Inference on a single V100 16G



GEMM and add bias

represent 83.9% now

#### Inference Speedup

• To get more speedup, GEMM-centric optimization is required

# WHAT WE CAN DO MORE

### Further Optimization Options

More Fusion and/or custom kernels in "Feed Forward" and "Multi-Head Attention"

- GEMM + add bias + layer norm + activation
- GEMM + scale + mask + softmax + GEMM

Apply Quantization or Use lower precision, e.g., FP16, INT8

FasterTransformer will be released soon

- Highly optimized BERT Transformer for Inference based on custom CUDA kernels and CUBLAS
- Will support various sequence length and multi-precisions

### MORE ABOUT TENSORFLOW CUSTOM OP Pitfalls and Tips

Use allocate\_output or allocate\_temp instead of explicit cudaMalloc()

- allocate\_output: used to allocate output tensor
- allocate\_temp: used to allocate temporary memory, not exposed as a tensor
- cudaFree() is not necessary because TF has its own memory management

If you encounter the undefined symbol error in building, check the list below

- -D\_GLIBCXX\_USE\_CXX11\_ABI=0 (if GCC >= 5.0)
- Library dependency, template specialization

## **RELATED SESSIONS IN AI CONFERENCE**

### Learn More About Inference and Profiling

Deep Learning Inference 가속화를 위한 NVIDIA의 기술 소개 - 이종환 (NVIDIA)

• 13:50 - 14:30, Track 2

TensorRT를 이용한 OCR Model Inference 성능 최적화 - 이현수 (카카오)

• 14:40 - 15:20, Track 2

GPU Profiling 기법을 통한 Deep Learning 성능 최적화 기법 소개 - 홍광수 (NVIDIA)

• 16:30 - 17:10 Track 3

### WE ARE HIRING! AI Developer Technology Engineer

Study and Develop cutting-edge techniques in DL/ML, and perform in-depth analysis and optimization to achieve the best possible performance on GPU architectures

Work directly with key developers to understand and solve the practical problems using GPUs

Collaborate closely with the architecture, libraries, tools and research teams at NVIDIA to influence the design of next-generation architectures, software, and programming models

In short, we do what was discussed in this session  $\odot$ 



# APPENDIX

## **RULES IN STEP 1**

### Things to Remember

• Multiple inputs and outputs are allowed

 .Attr("AttrName": "AttrType") is used to configure the op int32, float, double, bool, type and etc are allowed as "AttrType"

• .SetShape() defines the output shape

c->set\_output(i, shape): set ith output's shape

c->input(i): get ith input's shape

## STEP 1. REGISTER OP'S INTERFACE IN C++

### Calculate Output Shape and check errors

attention\_op.cc

```
.SetShapeFn([](shape inference::InferenceContext *c) {
   Status status = Status::OK();
   int in0 rank = shape inference::InferenceContext::Rank(c->input(0));
   int inl rank = shape inference::InferenceContext::Rank(c->input(1));
   if(in0 rank != in1 rank) {
     status.Update(errors::InvalidArgument(
           "The input ranks are mismatched(", in0 rank, "!=", in1 rank, ")"));
   std::vector<shape inference::DimensionHandle> out dims;
   int i;
   for(i=0; i<in0 rank-2; i++) {</pre>
     auto in0 dim val = shape inference::InferenceContext::Value(c->Dim(c->input(0), i));
     shape inference::DimensionHandle in1 dim;
     TF RETURN IF ERROR(c->WithValue(c->Dim(c->input(1), i), in0 dim val, &in1 dim));
     out dims.push back(in1 dim);
   out dims.push_back(c->Dim(c->input(0), i));
   out dims.push back(c->Dim(c->input(1), i));
   i++;
   auto in0 k = shape inference::InferenceContext::Value(c->Dim(c->input(0), i));
   auto in1 k = shape inference::InferenceContext::Value(c->Dim(c->input(1), i));
   if(in0 k] = in1 k
     status.Update(errors::InvalidArgument(
           "Invalid input matrices: mx", in0 k, " and nx", in1 k));
   3
   c->set output(0, c->MakeShape(out dims));
   return status;
```

Top 4 things to notice

- How to get input dim and rank
- How to make shape object
- How to handle invalid arguments
- How to set output shape