Opened 11 years ago

Closed 11 years ago

#347 closed error (fixed)

vn6.1 atmosphere doesn't integrate for more than a month or so

Reported by: agt Owned by: jeff
Component: UM Model Keywords: atmos hector example job
Cc: Platform:
UM Version: 6.1

Description

Dear all,

I'm trying to run UM version 6.1 on hector. My job is xenkb (which is derived closely from xczia). This run is not trying to do anything special, just 5 years, 1 month at a time, on 8x8 cores. I have added in the $PUM_61/hector_io.mf77 mod as I read on another ticket that this helped. I have not managed to get beyond two or three months however.

Somewhere in the leave file there is "qsresubmit: job not resubmitted due to NQS error".

The latest leave file is located in ~agt/um/umui_out/xenkb000.xenkb.d09322.t104944.leave
and other output in
~agt/work/xenkb/

Am I basing this on the correct example job?

In my profile I have:
TARGET_MC=pathscale_quad
and
UMSETUP=$UMDIR/vn6.1/$TARGET_MC/scripts/.umsetvars_6.1; export UMSETUP

Any advice is appreciated, thanks,

Andy

Change History (7)

comment:1 Changed 11 years ago by jeff

  • Owner changed from um_support to jeff
  • Status changed from new to accepted

Hi Andy

Something strange is going on here, but I'm not sure what. Am I right in assuming this is the 3rd 30 day chunk run you have done? Looking at your output files it seems the first 2 runs ran out of cpu time but completed anyway, and the job should take about 45 minutes to run so an hour CPU time should be plenty. I will run your job and see if I can reproduce the problem.

Jeff.

comment:2 Changed 11 years ago by agt

hi Jeff,

There was an initial compile+run with the copied run length of 3 days.
Then I did a new compile+run (under same job name) with a length of 5 years. This did just over a month and then I put the CRUN back in, which got it to the stage it is now I think,

thanks,

Andy

comment:3 Changed 11 years ago by jeff

Hi Andy

I've run a copy of your job and it seems fine, no cpu time limit exceeded and a CRUN resubmitted itself twice without problems. Why don't you resubmit your job as a CRUN and see what happens.

Jeff.

comment:4 Changed 11 years ago by agt

Jeff,

I resubmitted as a CRUN, and this is what happened after a while:
~agt/um/umui_out/xenkb000.xenkb.d09323.t180512.leave
this file includes, "export F_ERROPT1=271,271,2,2,2,2,2,2 # Stop underflow error
qsresubmit: job not resubmitted due to NQS error"

I also did a clean copy of this job, and set to compile and run. This one didn't even make it past the NRUN. See: xenzb000.xenzb.d09323.t182220.leave including such delights as, "aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 9) - process 3

ERROR!!! in reconfiguration in routine Rcf_Aux_File" etc.

I'm just trying xenzg which is cp xenzb but with the reconfiguration set to compile/build instead of run from standard, which sounds like it might help

cheers,

Andy

comment:5 Changed 11 years ago by agt

…to update: xenzg does seem to be running in CRUN (into the third month now), so touch wood it is ok and just needed a clean compilation.

comment:6 Changed 11 years ago by agt

Ok, still running the CRUN and now into July (month 11) in 1 month chunks. I think that means you can successfully close the ticket, which means I did something wrong somewhere,

thanks for your help,

Andy

comment:7 Changed 11 years ago by jeff

  • Resolution set to fixed
  • Status changed from accepted to closed

Glad to hear its working okay now, not at all sure what went wrong.

Jeff.

Note: See TracTickets for help on using tickets.