Re: [AMBER-Developers] Current amber CVS test fails from Daniel Roe on 2010-03-11 (Amber Developers Archive Mar 2010)

From: Daniel Roe <daniel.r.roe.gmail.com>
Date: Thu, 11 Mar 2010 09:36:32 -0500

I've been trying to uncover the source for the dependence of the trajene_box
test case results on the number of processors but so far I'm stumped.

When you do have a box and you run imin=5 with 1 processor, the results are
comparable to separating the trajectory into individual frames and
calculating the energy of each. There are very minor differences in the EEL
and VDW terms, but the RMS error is less than 0.001 kcal/mol (which is why
the trajene_box test passes on 1 processor). However, when you run imin=5
with more than 1 processor large energy differences compared to the 1 proc.
run start to show up in the EEL and VDW terms; RMS errors of about 3
kcal/mol (although the RMS of Epot is about 0.1). I've been through a lot of
the code but so far I can't seem to figure out what's causing the
difference. This dependence on the number of processors doesn't happen

I suspect that there may be a bunch of initialization that goes on in the
ewald routines at sander startup that I just don't reproduce between each
minimization of an imin=5 run, but why this is dependent on the number of
processors I don't know. I think just don't know the ewald code well enough
to figure out what could be going wrong. If this doesn't get figured out
soon maybe imin=5 + ntb>0 should be restricted to single processor only?

-Dan

On Wed, Mar 10, 2010 at 10:11 AM, Mark Williamson <mjw.sdsc.edu> wrote:

> Dear All,
>
> I am looking to land a major change to the internal ener and ene() arrays
> in the current amber11 code. This will essentially replace these with a
> module containing a nested derived type, called "state", for energy and
> other system properties, accounting within the code. This has taken a while
> to code since the existing ene() and ener() were quite esoteric and
> regularly abused.
>
> The testing has also taken some time and I still know of a few remaining
> issues with my changing that need addressing. I want to get this into the
> tree now and iron out the remaining issue. However the show stopper on this
> has been existing parallel test cases in the tree just being plain broken
> and new issues occurring when changing from "-np 2" to "-np 4". I've put the
> fails up at http://www.wmd-lab.org/mjw/for_amber_dev/090310/amber/ :
>
> Generally with "-np 2" the following are real fails:
>
> test/ti_decomp/ti_decomp_1.out.dif
> test/ti_decomp/ti_decomp_2.out.dif
> test/gb_rna
>
> and with "-np 4", the following *extra* are real fails:
>
> test/softcore/ (pretty much all the restarts)
> dynlmb
> test/ncsu/bbmde
>
> Can anyone comment?
>
> I have a good grasp of the few remaining fails that are occurring within my
> code that I have been working on locally, but I am reluctant to check them
> in with these existing fails since it may make the problem of solving them
> harder. I am under a tight deadline, and I would really appreciate the
> relevant owners fixing these, or I may just have to commit and thus make
> solving these harder.
>
> regards,
>
> Mark
>
> _______________________________________________
> AMBER-Developers mailing list
> AMBER-Developers.ambermd.org
> http://lists.ambermd.org/mailman/listinfo/amber-developers
>
_______________________________________________
AMBER-Developers mailing list
AMBER-Developers.ambermd.org
http://lists.ambermd.org/mailman/listinfo/amber-developers
Received on Thu Mar 11 2010 - 07:00:03 PST