2022

Apr 07

2022-04-07 ·

gpu

排列组合

隔板法

n 个球，k 个桶，有多少种分法？

相当于有 k - 1 个板子，一共有 n + k - 1 个物体，在这么多物体里放入 k - 1 个板子，剩下全部放球，因此是 C(n+k-1, k-1)，其中 C(n, r) = n! / (n-r)! r!

Ref: 巧用隔板法解答组合数学问题 - 橘子网

2021

Dec 31

2021-12-31 ·

gpu

Nsight analysis

nsys nvprof python benchmark.py
nv-nsight-cu-cli --launch-skip 1 --launch-count 3 python benchmark.py
nv-nsight-cu-cli --kernel-regex "cuAngularAEVs" --launch-skip 1 --launch-count 1 python benchmark.py

ncu -f -o my_report_%h_%i --set=detailed --section=SourceCounters --import-source=yes --launch-skip=1 --launch-count=3 python benchmark.py

memory efficency: cuda-training-series/exercises/hw8 at master · olcf/cuda-training-series

nv-nsight-cu-cli --metrics l1tex__t_sectors_pipe_lsu_mem_global_op_ld.sum,l1tex__t_requests_pipe_lsu_mem_global_op_ld.sum,l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_ld.ratio,l1tex__t_sectors_pipe_lsu_mem_global_op_st.sum,l1tex__t_requests_pipe_lsu_mem_global_op_st.sum,l1tex__average_t_sectors_per_request_pipe_lsu_mem_global_op_st.ratio,smsp__sass_average_data_bytes_per_sector_mem_global_op_ld.pct,smsp__sass_average_data_bytes_per_sector_mem_global_op_st.pct --launch-count 3 ./a.out

Ref:

2021

Aug 18

2021-08-18 ·

gpu

TorchANI Optimization

2021

Jul 05

2021-07-05 ·

gpu

GPU 数感 | 通感 | 常识

架构

GA100 Ampere Specs
- 8 GPCs, 8 TPCs/GPC, 2 SMs/TPC, 16 SMs/GPC, 128 SMs/GPU
  Ref: 2021-07-05-nvidia-ampere-architecture-whitepaper.pdf

FLOPS

每秒浮点运算次数 (FLOPS)

gigaFLOPS GFLOPS 10^9
teraFLOPS TFLOPS 10^12
NVIDIA GeForce RTX 2080 Ti ：13.45 TFLOPS

Memory

1 Million 的 float 是 4 MB 1 Billion 的 float 是 4 GB

GPU compute capability vs architecture list:

Compute Capability	Architecture Name
8	Ampere
7.5	Turing
7	Volta
6	Pascal
5	Maxwell
3	Kepler
2	Fermi
1	Tesla

Ref:

2021

Mar 18

2021-03-18 ·

gpu

CUDA Memory Transcation

读global mem，从L1走的话（默认L1），一个transaction是32个float。但是，可以指定从 L2走，L2走的话，一个transaction可以是1，2，4个segments，每个segments是8个float。所以如果memory access 不是 coalessing的话，L2走利用率更高。或者如果一个warp 访问的是同一个元素，L2也快。 Ref:

2013 GTC Programming-Guidelines-GPU-Architecture - backup
2012 GTC GPU-Performance-Analysis - backup
2010 GTC Better Performance at Lower Occupancy - backup
2018 GTC volta architecture and performance optimization
2011 register_spilling

2021

Feb 11

2021-02-11 ·

gpu

Scan (cumsum)

Blockwise communication

// wait for the previous block finish the partial sum
while(Atomicadd(sum[sbid], 0) == 0) {;} 
// do the job
do
// give signal to the next block
Atomicadd(sum[sbid+1], 1)

Dynamic blockIdx assignment

上面的问题希望让 block 顺序执行，如果 scheduler backward 分配 block，后面的 block 在运行，并且等待着previous block 的结果，但是 SM 不足以分配资源给 previous block，很可能会产生 deadlock。 blockIdx.x 本是 block 的 id，不用这个。因为这个id无法保证sceduler是顺序分配的。解决办法是在 global memory 定义一个 blockidcounter，每个block执行的时候，thread0 对counter atomicadd加一。这样blockidx就是顺序执行的了。

2020

Dec 22

2020-12-22 ·

gpu

排列组合

隔板法

Nsight analysis

TorchANI Optimization

GPU 数感 | 通感 | 常识

架构

FLOPS

Memory

GPU compute capability vs architecture list:

CUDA Memory Transcation

Scan (cumsum)

Blockwise communication

Dynamic blockIdx assignment

CUDA Optimization ｜ GTC