WeiNote

yyrcd

2021

Aug 18

2021-08-18 ·

TorchANI Optimization

email: yueyericardo@gmail.com Benchmark result showing here is purely torchani inference, MD interface (e.g. Amber) is still WIP.

CUAEV Optimization

Benchmark Result on GTX 2080 Ti Original:

Current:

Some Optimization detail: The timing below is for a system of 10k atoms (a different pdb as above) running on GTX 1080 (averaged for 200 iterations)

2021-04-20-13-17-38

NN Optimization

Background

For original ensemble, NN is CPU bounded for small systems: 8 models * 7 Networks (HCNOFSCl) * 7 layers (4 Linear + 3 CELU) = 392 kernel calls (will be doubled if also count backward)

Method

  • Infer Model: Fuse all same networks of an ensemble into BmmNetworks, for example 8 same H networks will be fused into 1 BmmNetwork. 7 BmmNetworks (HCNOFSCl) * 7 layers (4 BatchLinear + 3 CELU) = 49 kernel calls.
  • MNP: Parallelize between different species networks (HNCO…) by using C++ & OpenMP. For small system, NN is CPU-bounded, using multiple CPU threads (OpenMP) could reduce the CUDA kernel calls overhead.

Ensemble Benchmark

Original:

Infer Model (ON) + MNP (OFF):

Infer Model (ON) + MNP (ON):