Opened 3 years ago

Closed 3 years ago

#1961 closed help (fixed)

INCOMPASS suite fail

Reported by: amenon Owned by: um_support
Component: UM Model Keywords: Nesting suite
Cc: Platform: ARCHER
UM Version: 10.4

Description

Hi Ros,

Problem 1

Reconfiguration jobs "INCOMPASS_k4p4_2_v2p1_qcons_um_recon" and "INCOMPASS_km4p4_v2p1_qcons_um_recon" are failing with an error as follows (I have attached the screenshot of the suite cylc)

???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 10
?  Error from routine: Calc_nlookups
?  Error message: Ancillary files have not been found - Check output for details
?  Error from processor: 0
?  Error number: 0

When I check the job.out file, it shows that the following ancilliary files are missing

File No    6 /work/n02/n02/amenon/cylc-run/u-af584/share/cycle/20150601T0000Z/INCOMPASS/km4p4/v2p1_qcons/ics//ostia_seaice.anc
Ancillary File does not exist.
File : /work/n02/n02/amenon/cylc-run/u-af584/share/cycle/20150601T0000Z/INCOMPASS/km4p4/v2p1_qcons/ics//ostia_seaice.anc
Stashcode :    31
File No    7 /work/n02/n02/amenon/cylc-run/u-af584/share/cycle/20150601T0000Z/INCOMPASS/km4p4/v2p1_qcons/ics//ostia_sst.anc
Ancillary File does not exist.
File : /work/n02/n02/amenon/cylc-run/u-af584/share/cycle/20150601T0000Z/INCOMPASS/km4p4/v2p1_qcons/ics//ostia_sst.anc
Stashcode :    24

I checked those directories and these files don't exist. In the OSTIA directory in my ARCHER work folder (/work/n02/n02/amenon/suite/INC4P4/OSTIA) I see that the files are named as "20150531_ostia_seaice.anc", "20150531_ostia_sst.anc 20150601_ostia_seaice.anc" and " 20150601_ostia_sst.anc"

Problem 2

Forecast jobs "INCOMPASS_k4p4_2_v2p1_qcons_um_fcst_000" and "NCOMPASS_km4p4_v2p1_qcons_um_fcst_000" fail with error from routine: io: buffin as shown below

???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
?  Error code: 25
?  Error from routine: io:buffin
?  Error message: Error in buffin errorCode= 0.00 len=524288/937984
?  Error from processor: 0
?  Error number: 20

Suite id: u-af584

Any thoughts on this?

Cheers,
Arathy

Change History (6)

comment:1 Changed 3 years ago by ros

  • Reporter changed from ros to amenon

comment:2 Changed 3 years ago by ros

Update from Arathy:

Stu is back. I contacted him regarding the error. This was his reply

So the error is actually in the task that should create the missing files, i.e. its in the *_um_surf_ostia task.

The error is in the job.out file (on PUMA its in file
`/home/amenon/cylc-run/uaf584/log/job/20150601T0000Z/INCOMPASS_km4p4_v2
p1_qcons_um_surf_ostia/01/job.out`) and is that

 "XALT Error: unable to find aprun" .

So, can you try the following?

(1) on PUMA edit your site/ncas-cray-xc30/suite-adds.rc file, deleting the 4 lines below the line [[SURF_OSTIA]] , i.e. these ones…

[[[environment]]]

ROSE_LAUNCHER = {{SERIAL_RUN_CMD}}
ROSE_LAUNCHER_PREOPTS = -n 1
NPROC = 1

(2) Then I presume that your suite will have shut down. So rose suite-run --restart and then retrigger one of the failed um_surf_ostia tasks.

(3) Let me know what that yields!

I did this, but ended up with the same error. So Stu suggested me that I should contact NCAS-CMS and this might be more technical.

Thanks,
Arathy

comment:3 Changed 3 years ago by ros

Hi Arathy,

I found the offending aprun line in one of the SURF scripts so I’ve removed that and then SURF fails as I built it for the compute nodes not serial nodes as I didn’t realise it was run there. Just rebuilding it now and I’ll let you know when it good to try again.

Cheers,
Ros.

comment:4 Changed 3 years ago by ros

Hi Arathy,

After a few fights with ARCHER I think we now have a working SURF executable at least my copy of your INCOMPASS_km4p4_v2p1_qcons_um_surf_ostia and its sibling have both supposedly succeeded.

The directory for SURF has now changed as it's built for a different part of ARCHER with a different architecture, so you'll need to change the SURF source in

~/roses/u-af584/app/install_cold/opt/rose-app-ncas-cray-xc30.conf

It's just a change in the name from ivybridge to x86_64.

[file:$ROSE_SUITE_DIR/share/fcm_make_surf]
mode=symlink
source=/work/n02/n02/ros/SURF/SURF31.2.0/share/fcm_make_surf_xc30_x86_64_ifort_opt

I also had to change a couple of wallclock times as they got exceeded which I would recommend you to do as well.

~/rose/u-af584/site/ncas-cray-xc30/suite-adds.rc

Changed from 00:10:00 to 00:20:00 under:

[[HOST_HPC]]
....
        -l walltime = 00:20:00

Changed from 01:00:00 to 02:00:00
[[BUILD_HPC]]
....
        -l walltime = 02:00:00

Cheers,
Ros.

comment:5 Changed 3 years ago by ros

Hi Ros,

Excellent! Thanks a lot. My suite was not running when I got your mail. I made all these changes, changed the wall clock time too and restarted the suite. The INCOMPASS_km4p4_v2p1_qcons_um_surf_ostia succeeded for me too. Thanks.

Cheers,
Arathy

comment:6 Changed 3 years ago by ros

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.