Re: amber-developers: cap, igb=10 (or not) and charge perturbation

From: B. Lachele Foley <>
Date: Mon, 6 Mar 2006 13:21:12 -0700

>Is this reproducible? It sounds like an mpi problem --

Yes, it's reproducible and yeah, it does. I'd looked through
the /var/spool/PBS/* info before I wrote, though, and didn't
see anything. The output files are similarly uninformative.

We just re-ran one of the jobs. I still saw no useful output
and there were no PBS complaints on the head node or on any of
the four execute nodes (8 processors).

The closest thing I found has to do with the myrinet
connections. From a sample job, /var/log/messages shows these
errors from the 4 nodes in question at about the time the job
was running:

Mar 6 14:40:28 node58 kernel: GM: pid 6964:fork() support
limited: send_queue is not first vma
Mar 6 14:40:28 node57 kernel: GM: pid 26333:fork() support
limited: send_queue is not first vma
Mar 6 14:40:28 node54 kernel: GM: pid 15664:fork() support
limited: send_queue is not first vma
Mar 6 14:40:28 node58 kernel: GM: pid 6963:fork() support
limited: send_queue is not first vma
Mar 6 14:40:28 node52 kernel: GM: pid 28351:fork() support
limited: send_queue is not first vma

I see errors like this a lot that don't cause trouble. So,
this might not be the issue.

The myrinet site contains the notice quoted below
( I guess I'll
send them the information they request. Do you think their
description is likely to apply here?

What does the GM-2 NOTICE message "pid xxxx:fork() support
limited: send_queue is not first vma" mean?

Does the MPI application use fork(), and use lots of memory?

This message is printed when an executable has a slightly
unusual layout in memory (which depends on how it was
compiled). Then there will be a small restriction on fork()
usage: if a GM application fails in the middle of the fork(),
there is a possibility that the gm ports of this application
will no longer be usable (and will need to be closed before
working again) when fork() returns -1.

In that particular case, the message should not cause any
operational problem:

    * if the application does not fork(), nothing bad can happen.
    * if it fork()s successfully, nothing bad will happen.
    * there would only be a problem if:
          o there is a memory resource starvation in the
middle of the fork()
          o the application tries to continue using the GM
port after this failure.

The advance warning is printed so that we would know if some
executable did not confirm to some driver requirements before
it can cause a problem.

Please send the output of:

  objdump --section-headers <executable>

(with the application executable that leads to this message),
as well as what linker is used to compile the executable and
any flags passed to it, to for further assistance.

>What happens if set nstlim to a number shorter than the
>value at which the simulation stopped? Do you get a full
>output with reasonable looking results?

Yep. It runs all the way to:

"| wallclock() was called 360970 times"

The user says the output looks pretty normal, too.

:-) L
Received on Wed Apr 05 2006 - 23:49:41 PDT
Custom Search