Opened 9 months ago

Closed 9 months ago

#2702 closed help (fixed)

Model failing after Archer Maintenance

Reported by: ha392 Owned by: um_support
Component: Coupled model Keywords:
Cc: Platform:
UM Version: 10.7

Description

Hello,

I am running a coupled model with suite u-bc185. After yesterdays maintenance, I tried to trigger my model where it left off and it is failing each time.

The error reads:
Rank 203 [Thu Dec 13 10:12:13 2018] [c7-2c0s15n0] application called MPI_Abort(comm=0xC4000003, 1) - process 197
Rank 198 [Thu Dec 13 10:12:13 2018] [c7-2c0s15n0] application called MPI_Abort(comm=0xC4000009, 1) - process 192
Rank 246 [Thu Dec 13 10:12:13 2018] [c7-2c0s15n2] application called MPI_Abort(comm=0xC4000009, 1) - process 240
_pmiu_daemon(SIGCHLD): [NID 04476] [c7-2c0s15n0] [Thu Dec 13 10:12:13 2018] PE RANK 203 exit signal Aborted
Rank 222 [Thu Dec 13 10:12:13 2018] [c7-2c0s15n1] application called MPI_Abort(comm=0xC4000009, 1) - process 216
_pmiu_daemon(SIGCHLD): [NID 04478] [c7-2c0s15n2] [Thu Dec 13 10:12:13 2018] PE RANK 246 exit signal Aborted
Rank 212 [Thu Dec 13 10:12:13 2018] [c7-2c0s15n0] application called MPI_Abort(comm=0xC4000003, 1) - process 206
[NID 04476] 2018-12-13 10:12:13 Apid 33037658: initiated application termination
[FAIL] run_model # return-code=137
Received signal ERR
cylc (scheduler - 2018-12-13T10:12:17Z): CRITICAL Task job script received signal ERR at 2018-12-13T10:12:17Z
cylc (scheduler - 2018-12-13T10:12:17Z): CRITICAL failed at 2018-12-13T10:12:17Z

This has not happened before and I do not believe it to be a problem with the model its self. How do I approach this?

Thank you
Holly

Change History (18)

comment:1 Changed 9 months ago by grenville

Holly

Please change permissions on your spaces so we can read

chmod -R g+rX /home/n02/n02/<username>
chmod -R g+rX /work/n02/n02/<username>

Grenville[]

comment:2 Changed 9 months ago by ha392

Hi Grenville

Done.

Holly

comment:3 Changed 9 months ago by grenville

Holly

Something strange has happened here - the coupled job succeeded on its first attempt (see /home/n02/n02/ha392/cylc-run/u-bc185/log/job/18781001T0000Z/coupled/01/job.status), but then cylc seemed to forget?

I suggest you stop the suite. Then rose suite-run —restart, and set the coupled task to succeeded.

If you're lucky this will work - there is a possibility that the history files may be out of sync.

Grenville

comment:4 Changed 9 months ago by grenville

Holly

You might want to change the cycling frequency - 1-month cycling for a 200 year integration represents a very large IO overhead.

Grenville

comment:5 Changed 9 months ago by ha392

Grenville,

I have just tried this and it has worked okay for post processing atmosphere and CICE, but it is not working for NEMO or the following coupled month (multiple submission retries).
The ouptut files do exist in the NEMOhist file.

This is the error I am getting from the postproc_nemo process.
[ERROR] icb_pp: Error=1

Traceback (most recent call last):

File "/work/n02/n02/ha392/cylc-run/u-bc185/share/fcm_make_pp/build/bin/icb_pp.py", line 82, in <module>

icu = np.concatenate(icu)

ValueError?: need at least one array to concatenate

→ Failed to rebuild file: trajectory_icebergs_18781001-18781101

[FAIL] Command Terminated
[FAIL] Terminating PostProc?
[FAIL] main_pp.py nemo # return-code=1
Received signal ERR
cylc (scheduler - 2018-12-13T12:41:59Z): CRITICAL Task job script received signal ERR at 2018-12-13T12:41:59Z
cylc (scheduler - 2018-12-13T12:41:59Z): CRITICAL failed at 2018-12-13T12:41:59Z

What would be the next best way to proceed?

Also, thank you for this suggestion, what would you suggest a more suitable cycling frequency be?- 1 year? And if I change this, will I be able to continue (—restart) where I left off? (I am still fairly new to UM).

Holly

comment:6 Changed 9 months ago by grenville

(we've seen this before in ticket 2422 — please do search the helpdesk tickets)

To fix the post proc problem

Please copy the file ~ros/temp/icb_pp.py into your ~/cylc-run/u-au022/share/fcm_make_pp/build/bin directory on ARCHER. And re-trigger the failed postproc task and hopefully that will fix the problem.

If you want to include this fix in future suites you will need to set the following in fcm_make_pp → Configuration

config_base: fcm:moci.xm-br/dev/davestorkey/postproc_2.2_iceberg_update@2477
config_rev: @2477
pp_rev: 2477

Investigating the coupled job failure

Grenville

comment:7 Changed 9 months ago by grenville

Holly

You appear to have run 18781001T0000 again - it's difficult for us to help to fix a moving target - the 18781001T0000 cycle has not left start files to begin the 18781101T0000 cycle; were you trying to get those start files; did you perform a warm start on the 18781001T0000 cycle?

Grenville

comment:8 Changed 9 months ago by ha392

Hi Grenville,

18781001T0000 was the job that was failing before because it had already ran, so I set it to completed as per above. I have not touched the model since then (apart from to fix the post processing stuff from above).

Holly

comment:9 Changed 9 months ago by grenville

Ahhhh - my mistake sorry.

I think it's worth re-running 18781001T0000 so it generates the start files for 18781101T0000. I don't understand why the ocean files are missing for 187811 - the atmosphere looks OK.

Please stop the suite, then at the PUMA command line (in roses/u-bc185)

rose suite-run —warm u-bc185 18781001T0000

This should fully rerun the cylce & hopefully generate ocean files. You probably don't need to do the post processing again - put these tasks in the Held state, and release them later or just set to succeeded if you're happy with the re-runu.

Grenville

comment:10 Changed 9 months ago by ha392

Thank you for this Grenville, I seem to be getting an error:

cylc-run: error: Wrong number of arguments (too many)

I copied this directly, am I doing something wrong here?

Holly

comment:11 Changed 9 months ago by grenville

Holly
more haste - less speed

rose suite-run — —warm 18781001T0000Z

Grenville

comment:12 Changed 9 months ago by ha392

Hi Grenville,

The 18781001T0000Z coupled job appears to be failing again. Are there any files that I need to clear out for this to work, seeing as it partially ran before (what seems to have gotten into this mix up with the jobs)?

Holly

comment:13 Changed 9 months ago by ha392

Here is the error message attached, it appears that the model is not creating this file for the following month.

Traceback (most recent call last):
  File "./link_drivers", line 183, in <module>
    envinsts, launchcmds = _run_drivers(common_envars, mode)
  File "./link_drivers", line 66, in _run_drivers
    '(common_envars,\'%s\')' % (drivername, mode)
  File "<string>", line 1, in <module>
  File "/fs2/n02/n02/ha392/cylc-run/u-bc185/work/18781001T0000Z/coupled/um_driver.py", line 320, in run_driver
    exe_envar = _setup_executable(common_envar)
  File "/fs2/n02/n02/ha392/cylc-run/u-bc185/work/18781001T0000Z/coupled/um_driver.py", line 195, in _setup_executable
    common_envar['CYLC_TASK_WORK_DIR'])
  File "/fs2/n02/n02/ha392/cylc-run/u-bc185/work/18781001T0000Z/coupled/um_driver.py", line 80, in _verify_fix_rst
    old_hist_files = [f for f in os.listdir(old_hist_path) if
OSError: [Errno 2] No such file or directory: '/work/n02/n02/ha392/cylc-run/u-bc185/work/18781101T0000Z/coupled/history_archive'
[FAIL] run_model # return-code=1
Received signal ERR
cylc (scheduler - 2018-12-17T10:48:17Z): CRITICAL Task job script received signal ERR at 2018-12-17T10:48:17Z
cylc (scheduler - 2018-12-17T10:48:17Z): CRITICAL failed at 2018-12-17T10:48:17Z

comment:14 Changed 9 months ago by grenville

Holly

Please remove entirely /work/n02/n02/ha392/cylc-run/u-bc185/work/18781101T0000Z (it's not needed), then repeat the warm start as before.

Grenville

comment:15 Changed 9 months ago by ha392

Hi Grenville,

I appear to be back to the original error.

Rank 235 [Mon Dec 17 19:03:56 2018] [c0-3c2s8n3] application called MPI_Abort(comm=0xC4000003, 1) - process 229
Rank 198 [Mon Dec 17 19:03:56 2018] [c0-3c2s7n2] application called MPI_Abort(comm=0xC4000009, 1) - process 192
Rank 206 [Mon Dec 17 19:03:56 2018] [c0-3c2s7n2] application called MPI_Abort(comm=0xC4000003, 1) - process 200
Rank 262 [Mon Dec 17 19:03:56 2018] [c0-3c2s9n0] application called MPI_Abort(comm=0xC4000005, 1) - process 256
Rank 252 [Mon Dec 17 19:03:56 2018] [c0-3c2s9n0] application called MPI_Abort(comm=0xC4000003, 1) - process 246
Rank 257 [Mon Dec 17 19:03:56 2018] [c0-3c2s9n0] application called MPI_Abort(comm=0xC4000003, 1) - process 251
Rank 208 [Mon Dec 17 19:03:56 2018] [c0-3c2s7n2] application called MPI_Abort(comm=0xC4000003, 1) - process 202
Rank 236 [Mon Dec 17 19:03:56 2018] [c0-3c2s8n3] application called MPI_Abort(comm=0xC4000003, 1) - process 230
_pmiu_daemon(SIGCHLD): [NID 04771] [c0-3c2s8n3] [Mon Dec 17 19:03:56 2018] PE RANK 235 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 04772] [c0-3c2s9n0] [Mon Dec 17 19:03:56 2018] PE RANK 262 exit signal Aborted
Rank 220 [Mon Dec 17 19:03:56 2018] [c0-3c2s7n2] application called MPI_Abort(comm=0xC4000003, 1) - process 214
_pmiu_daemon(SIGCHLD): [NID 04766] [c0-3c2s7n2] [Mon Dec 17 19:03:56 2018] PE RANK 206 exit signal Aborted
Rank 204 [Mon Dec 17 19:03:56 2018] [c0-3c2s7n2] application called MPI_Abort(comm=0xC4000003, 1) - process 198
[NID 04766] 2018-12-17 19:03:56 Apid 33050610: initiated application termination
[FAIL] run_model # return-code=137
Received signal ERR
cylc (scheduler - 2018-12-17T19:04:00Z): CRITICAL Task job script received signal ERR at 2018-12-17T19:04:00Z
cylc (scheduler - 2018-12-17T19:04:00Z): CRITICAL failed at 2018-12-17T19:04:00Z

comment:16 Changed 9 months ago by grenville

Hi Holly

That did fix one problem - please copy

/home/n02/n02/ha392/cylc-run/u-bc185/work/18780901T0000Z/coupled/namelist_cfg to
/home/n02/n02/ha392/cylc-run/u-bc185/share/data/History_Data/NEMOhist/ (overwrite the namelist_cfg file that's there already)

then warm start again.

Grenville

comment:17 Changed 9 months ago by ha392

Hi Grenville,

Thank you for this, the model appears to be running smoothly now.

Holly

comment:18 Changed 9 months ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed

Glad it's working now.

Note: See TracTickets for help on using tickets.