[AMBER-Developers] New GB simulation engine: mdgx.cuda

From: David Cerutti <dscerutti.gmail.com>
Date: Thu, 23 May 2019 13:23:43 -0400

Dear Developers,

I am pleased to announce the beta release of my latest contribution to the
Amber fleet of simulators, mdgx.cuda in the mdgxCuda branch of the
repository. The new &pptd module is intended to simulate small systems 928
atoms or fewer, in GB or vacuum conditions. As with pmemd.cuda GB, there
is no cutoff to be concerned about: all particles interact with all other
particles. The twist is that the program simulates more than one system at
a time: dozens or even hundreds. It devotes one block of the GPU grid to
each system and runs all of dynamics in just one kernel, which can proceed
for thousands of steps before either moving on to a different system in the
master list or at last shutting down.

For individual simulations, the engine can proceed at a significant
fraction of the speed of pmemd, and for very small systems it can even push
many simulations at a faster pace than pmemd can use the entire card to
push just one. An RTX-6000 (similar to a 2080Ti) running mdgx.cuda can
push 72 copies of a 900 atom system with igb=8 at about 15% of the pace
that pmemd can push each of them (total throughput 10x greater than
pmemd.cuda). That same RTX-6000 can push 72 15-residue, 225-atom systems
each at the pace pmemd.cuda can do just one (72x greater throughput), and
for tiny oligopeptides the speedup is even greater (the card can produce
hundreds of microseconds of aggregate trajectory per day). The thing is
also designed to "gear down" when hundreds of copies are in play, devoting
smaller blocks to each system in order to get the best overall output (this
feature works as intended, but will get more polishing soon).

There are efforts underway to add temperature and Hamiltonian REMD to the
module (replicas at different temperatures and interpolation between
end-point topologies are already supported, it's the exchange that isn't
yet ready). The new module is not limited to many copies of a single
system, however: an investigator with dozens of small peptides or
oligosaccharides can queue them up in the same input deck. mdgx.cuda will
turn the GPU into a miniature Beowulf cluster and use the GPU block
scheduler as the queueing system to keep the entire card busy as long as

The new engine supports RATTLE (the SHAKE equivalent for mdgx's Velocity
Verlet integrator) as well as a multiple time-stepping scheme which appears
to have a slight speed advantage over RATTLE that grows with smaller system
sizes. Either method can simulate systems at a 4fs time step, with the MTS
method (which updates all bonds, angles, and 1-4 interactions as part of
the short step) appearing to have an advantage in energy conservation as
well. A Langevin thermostat is provided to cover a multitude of sins.

At present I am working out a few details regarding register usage and
kernel branching. (The dynamics kernel is pretty stuffed--if we want to add
more features I'm going to have to find ways to keep the register pressure
down so we don't have to drop our thread counts and thus overall speed.)
The company that funded the development of this software, Rubryc
Therapeutics Inc., is using the engine to great effect, getting 20x the
product out of their GTX-1080Ti cluster with 30x more results expected from
a new RTX cluster. We are also applying this in a collaboration with a
group at UC Davis to study some glycopeptides in the gas phase--in these
cases an RTX-2080Ti GPU appears to be worth about 900 CPU cores running

If I can get beta testors for the mdgxCuda branch in the repository, I
would love feedback and stress testing. Extra features are also possible,
but as mentioned the kernel is getting pretty stuffed so it will take some
care to keep the GPU performance up while adding new capabilities. To get
started, switch to the mdgxCuda branch, configure amber with -cuda and
compile, then run mdgx.cuda -PPTD to see the on-board manual describing the
inputs. The attached test case will show the operation of the program on
an array of systems. To run the test case, do:

${AMBERHOME}/bin/mdgx.cuda -O -i mdgxTest.in <http://mdgx.in>

As with other things mdgx, there are some niceties in there like system and
input sanity checking, also auto-detecting available GPUs and being polite
if all are taken, that can make their way into pmemd. (However, for
logistical reasons, this multi-simulation capability will be unique to
mdgx--the pmemd code would take extensive rewriting, including a new GB
engine and major changes in the Fortran layer, to support this feature.)
The CPU and GPU versions of the mdgx GB code are designed to work on the
same fp32 / int32 accumulation precision model, but there is no DPFP
version as yet.


AMBER-Developers mailing list

Received on Thu May 23 2019 - 10:30:03 PDT
Custom Search