TorchANI Optimization
email: yueyericardo@gmail.com
Benchmark result showing here is purely torchani inference, MD interface (e.g. Amber) is still WIP.
CUAEV Optimization
Benchmark Result on GTX 2080 Ti
Original:

Current:

Some Optimization detail:
The timing below is for a system of 10k atoms (a different pdb as above) running on GTX 1080 (averaged for 200 iterations)

NN Optimization
Background
For original ensemble, NN is CPU bounded for small systems: 8 models * 7 Networks (HCNOFSCl) * 7 layers (4 Linear + 3 CELU) = 392 kernel calls (will be doubled if also count backward)
Method
- Infer Model: Fuse all same networks of an ensemble into BmmNetworks, for example 8 same H networks will be fused into 1 BmmNetwork. 7 BmmNetworks (HCNOFSCl) * 7 layers (4 BatchLinear + 3 CELU) = 49 kernel calls.
- MNP: Parallelize between different species networks (HNCO…) by using C++ & OpenMP. For small system, NN is CPU-bounded, using multiple CPU threads (OpenMP) could reduce the CUDA kernel calls overhead.
Ensemble Benchmark
Original:

Infer Model (ON) + MNP (OFF):

Infer Model (ON) + MNP (ON):
