Opened 4 years ago

Closed 4 years ago

#2031 closed help (fixed)

Cannot read PE0 file and atmos.xhist

Reported by: mattjbr123 Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: MONSooN
UM Version: 10.3


I feel like I am missing something really obvious here, so apologies if I've just forgotten a trivial thing.

I am running suite u-af404 which I have adapted from u-ae805 into an AMIP-style climate ensemble suite with nudging and build/run switches implemented. At the moment I am only running it with 1 ensemble member as an initial test. Each ensemble member is characterised by a 'RIPcode' (I can't remember exactly what the acronym is) which is basically a number that sets the ENSMEMBER variable, used by the stochastic physics scheme to introduce different random noise into each member. These codes are also used in the filenames and directories for each ensmember.

However, when I attempt to run this suite, the install_ancil, fcm_make, fcm_make2 and recon tasks all run successfully, but the atmos_main task does not. The first time it tries to submit it the following appears in the job.err file:

apsched: claim exceeds reservation's node-count
[FAIL] um-atmos # return-code=1
Received signal ERR
cylc (scheduler - 2016-12-02T18:47:43Z): CRITICAL Task job script received signal ERR at 2016-12-02T18:47:43Z
cylc (scheduler - 2016-12-02T18:47:43Z): CRITICAL failed at 2016-12-02T18:47:43Z

with this near the end of the job.out file:

[INFO] command: um-atmos
[INFO] Using executable: /home/mabro/cylc-run/u-af404/share/fcm_make_um/build-atmos/bin/um-atmos.exe
[INFO] Using script: /home/mabro/cylc-run/u-af404/share/fcm_make_um/build-atmos/bin/um-atmos
[INFO] exec /opt/cray/alps/5.2.4-2.0502.9822.32.1.ari/bin/aprun -ss -n 224 -cc numa_node -N 16 -S 8 -d 2 -j 1 /home/mabro/cylc-run/u-af404/share/fcm_make_um/build-atmos/bin/um-atmos.exe
Could not find PE0 output file: pe_output/atmos.fort6.pe000

On subsequent submit-retries, the errors are different, with the job.err file showing:

[FAIL] Cannot read history file atmos.xhist
[FAIL] um-atmos # return-code=30
Received signal ERR
cylc (scheduler - 2016-12-02T18:49:17Z): CRITICAL Task job script received signal ERR at 2016-12-02T18:49:17Z
cylc (scheduler - 2016-12-02T18:49:17Z): CRITICAL failed at 2016-12-02T18:49:17Z

with no obvious error messages in the job.out file.

I understand the xhist file is used to store the namelists that the model needs to restart in a continuation run, but I find it odd that it requires it even when the model is starting afresh from a dump file. And as for why it can't find the pe0 file I have no idea. Again maybe it's something to do with it thinking it's restarting from some CRUN and expecting to find it there.
The dump file I am using is /projects/ocean/hadgem3/initial/atmos/N96L85/ab642a.da19880901_00 on MONSooN. Is it possible that something in the dump file is telling the model to restart as a CRUN? Or is it something far more obvious than that?

The log files should be present in my suite u-af404 on MONSooN (i.e. /home/mabro/roses/u-af404).

Any suggestions?

Thanks as always!
Username: mabro
Project: solar

Change History (1)

comment:1 Changed 4 years ago by mattjbr123

  • Resolution set to fixed
  • Status changed from new to closed

Fixed! It was to do with the apsched message - there was an error in the macro that calculated the number of nodes to used.

Note: See TracTickets for help on using tickets.