Opened 9 months ago

Closed 9 months ago

#2674 closed help (answered)

Problem with re-submission

Reported by: aschurer Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: Monsoon2
UM Version: 11.0

Description

Hello,

I am running a UKESM experiment: u-ba056

It completes the first cycle OK but does not start the second.

I am currently trying to complete 1 model year.
I have first set the cycling frequency to 3 months - and it ran for 3 months and had an error before starting the fourth month.
I then set the cycling frequency to 1 month - and it only ran for 1 month before stopping. So the problem does not seem related to any particular month.

The error message from the second "coupled" process contained the following lines:

Rank 1248 [Tue Nov 13 15:13:16 2018] [c11-2c1s7n2] application called MPI_Abort(comm=0xC4000006, 1) - process 1248
Application 45579353 is crashing. ATP analysis proceeding…
Rank 1152 [Tue Nov 13 15:13:16 2018] [c11-2c1s6n3] application called MPI_Abort(comm=0xC4000009, 1) - process 1152
………
………
Rank 1221 [Tue Nov 13 15:13:19 2018] [c11-2c1s7n1] application called MPI_Abort(comm=0xC4000003, 1) - process 1221
Rank 1222 [Tue Nov 13 15:13:19 2018] [c11-2c1s7n1] application called MPI_Abort(comm=0xC4000003, 1) - process 1222
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 07582] [c11-2c1s7n2] [Tue Nov 13 15:17:18 2018] PE RANK 1244 exit signal Aborted
[NID 07582] 2018-11-13 15:17:18 Apid 45579353: initiated application termination
[FAIL] run_model # return-code=137
2018-11-13T15:17:28Z CRITICAL - failed/EXIT

I'm not sure what the problem is so any help will be gratefully received.

Thanks,
Andrew

Change History (6)

comment:1 Changed 9 months ago by grenville

Andrew

The ocean drivers get confused if the work directory contains data from multiple different cycle lengths. Delete everything in /home/d05/aschurer/cylc-run/u-ba056/work and run again — it may still fail, but if so, diagnose the problem at that point without rerunning at a different cycle length.

Grenville

comment:2 Changed 9 months ago by aschurer

Hi Grenville,

Thanks for your suggestion.

I decided to delete the whole directory /home/d05/aschurer/cylc-run/u-ba056 to avoid any potential conflicts and start with a clean job.

I've resubmitted the experiment and now get an error in install_ancil which I have not seen before:

[FAIL] [Errno 13] Permission denied: '/home/d05/aschurer/cylc-run/u-ba056/share/data/etc/nemo_ancils_gl'
[FAIL] install: /home/d05/aschurer/cylc-run/u-ba056/share/data/etc/nemo_ancils_gl
[FAIL] source: /projects/ocean/hadgem3/ancil/ancil_versions/GC3.1/GC3.1_eORCA1v2.2x_nemo_ancils_vn10p9
2018-11-14T11:02:01Z CRITICAL - failed/EXIT

I've checked and I can definitely read the file /home/d05/aschurer/cylc-run/u-ba056/share/data/etc/nemo_ancils_gl from the command line on monsoon.

Can you advise as to what I've done wrong.

Many thanks,
Andrew

comment:3 Changed 9 months ago by grenville

Andrew

Please try to retrigger the failed task - install_ancil worked previously; I can't see why it would fail now

Grenville

comment:4 Changed 9 months ago by grenville

Andrew

Did re-triggering work?

comment:5 Changed 9 months ago by aschurer

Hi Grenville,
Sorry for not replying sooner.
I tried to re-trigger the task several times but always ran into the same problem.
I decided to copy the whole experiment to start afresh (now u-bd077).
Not sure what the problem was but this seemed to solve it.
Thanks,
Andrew

comment:6 Changed 9 months ago by grenville

  • Resolution set to answered
  • Status changed from new to closed

Thanks for letting us know.

Note: See TracTickets for help on using tickets.