Aug 18

2021-08-18 ·

TorchANI Optimization

email: yueyericardo@gmail.com Benchmark result showing here is purely torchani inference, MD interface (e.g. Amber) is still WIP.

CUAEV Optimization

Benchmark Result on GTX 2080 Ti Original:


Some Optimization detail: The timing below is for a system of 10k atoms (a different pdb as above) running on GTX 1080 (averaged for 200 iterations)


NN Optimization


For original ensemble, NN is CPU bounded for small systems: 8 models * 7 Networks (HCNOFSCl) * 7 layers (4 Linear + 3 CELU) = 392 kernel calls (will be doubled if also count backward)


  • Infer Model: Fuse all same networks of an ensemble into BmmNetworks, for example 8 same H networks will be fused into 1 BmmNetwork. 7 BmmNetworks (HCNOFSCl) * 7 layers (4 BatchLinear + 3 CELU) = 49 kernel calls.
  • MNP: Parallelize between different species networks (HNCO…) by using C++ & OpenMP. For small system, NN is CPU-bounded, using multiple CPU threads (OpenMP) could reduce the CUDA kernel calls overhead.

Ensemble Benchmark


Infer Model (ON) + MNP (OFF):

Infer Model (ON) + MNP (ON):