#2648 closed help (fixed)

u-bc158 (copy of u-ba915) Job Timing Out

Reported by: cwc46 Owned by: um_support
Component: UKCA Keywords: Caught signal Terminated
Cc: Luke, Abraham Platform: ARCHER
UM Version: 11.0

Description

I am running a u-bc158, a local copy of u-ba915. However, I am encountering this error:

⇒> PBS: job killed: walltime 6002 exceeded limit 6000
aprun: Apid 32421499: Caught signal Terminated, sending to application
Terminated
Received signal TERM
/work/n02/n02/glenchua/cylc-run/u-bc158/share/fcm_make_um/build-atmos/bin/um-atmos: line 135: 21803 Terminated rose mpi-launch -v $COMMAND
_pmiu_daemon(SIGCHLD): [NID 02508] [c5-1c0s3n0] [Mon Oct 22 11:59:55 2018] PE RANK 289 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 01442] [c7-0c1s8n2] [Mon Oct 22 11:59:55 2018] PE RANK 51 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 01444] [c7-0c1s9n0] [Mon Oct 22 11:59:55 2018] PE RANK 82 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 01459] [c7-0c1s12n3] [Mon Oct 22 11:59:55 2018] PE RANK 147 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 02510] [c5-1c0s3n2] [Mon Oct 22 11:59:55 2018] PE RANK 338 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 01441] [c7-0c1s8n1] [Mon Oct 22 11:59:55 2018] PE RANK 26 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 01100] [c5-0c2s3n0] [Mon Oct 22 11:59:55 2018] PE RANK 1 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 02507] [c5-1c0s2n3] [Mon Oct 22 11:59:55 2018] PE RANK 268 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 01462] [c7-0c1s13n2] [Mon Oct 22 11:59:55 2018] PE RANK 200 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 01457] [c7-0c1s12n1] [Mon Oct 22 11:59:55 2018] PE RANK 123 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 01460] [c7-0c1s13n0] [Mon Oct 22 11:59:55 2018] PE RANK 168 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 01540] [c0-1c0s1n0] [Mon Oct 22 11:59:55 2018] PE RANK 217 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 02509] [c5-1c0s3n1] [Mon Oct 22 11:59:55 2018] PE RANK 318 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 01541] [c0-1c0s1n1] [Mon Oct 22 11:59:55 2018] PE RANK 244 exit signal Terminated
_pmiu_daemon(SIGCHLD): [NID 01455] [c7-0c1s11n3] [Mon Oct 22 11:59:56 2018] PE RANK 107 exit signal Terminated
cylc (scheduler - 2018-10-22T11:59:56Z): CRITICAL Task job script received signal TERM at 2018-10-22T11:59:56Z
cylc (scheduler - 2018-10-22T11:59:56Z): CRITICAL failed at 2018-10-22T11:59:56Z

I can't seem to find what is causing the error. When someone else (Dr Luke Abraham) ran it, it managed to run to completion.

ARCHER Path to error log: /home/n02/n02/glenchua/cylc-run/u-bc158/log/job/19880901T0000Z/atmos_main/01

Change History (11)

comment:1 Changed 12 months ago by luke

To add to this, Glen copied a suite of mine and had this error. I copied his suite back and it ran successfully for me. It seems to be over 3-times slower for Glen as it only seems to reach 720 timesteps (10-days) in the 1 hour 40 minutes allocated.

comment:2 Changed 12 months ago by grenville

Could you try running with Luke's executable (/work/n02/n02/luke/cylc-run/u-ba915/share/fcm_make_um/build-atmos/bin/um-atmos.exe)

Luke - what is the suite id of your copy of Glen's suite?

comment:3 Changed 12 months ago by luke

My suite ID is u-bc219.

comment:4 Changed 12 months ago by cwc46

Dear Grenville and Luke,

I tried to run the executable but got this error:

[Mon Oct 22 15:46:29 2018] [unknown] Fatal error in PMPI_Init_thread: Other MPI error, error stack:
MPIR_Init_thread(547):
MPID_Init(203)…….: channel initialization failed
MPID_Init(584)…….: PMI2 init failed: 1
Aborted

comment:5 Changed 12 months ago by grenville

Looks like it failed in writing the dump at ts 720 — does this always happen?

(I'd mistaken Luke's working job)

comment:6 Changed 12 months ago by cwc46

Dear Grenville,

Luke just walked me through how to run the executable properly and I have just run it on u-bc286

comment:7 Changed 12 months ago by cwc46

Dear Grenville,

I got this error after trying to run the executable:

[FAIL] login.archer.ac.uk:/home2/n02/n02/glenchua/cylc-run/u-bc286/share/fcm_make_pp ← /home/cwc46/cylc-run/u-bc286/share/fcm_make_pp/extract: mirror failed
[FAIL] rsync -a —exclude=.* —delete-excluded —timeout=900 —rsh=ssh\ -oBatchMode=yes /home/cwc46/cylc-run/u-bc286/share/fcm_make_pp/extract login.archer.ac.uk:/home2/n02/n02/glenchua/cylc-run/u-bc286/share/fcm_make_pp # rc=12
[FAIL] ————————————————————————————————————————
[FAIL] This is a private computing facility. Access to this service is limited to those
[FAIL] who have been granted access by the operating service provider on behalf of the
[FAIL] contracting authority and use is restricted to the purposes for which access was
[FAIL] granted. All access and usage are governed by the terms and conditions of access
[FAIL] agreed to by all registered users and are thus subject to the provisions of the
[FAIL] Computer Misuse Act, 1990 under which unauthorised use is a criminal offence.
[FAIL]
[FAIL] If you are not authorised to use this service you must disconnect immediately.
[FAIL] ————————————————————————————————————————
[FAIL]
[FAIL] rsync: write failed on "/home2/n02/n02/glenchua/cylc-run/u-bc286/share/fcm_make_pp/extract/pp/Postprocessing/common/utils.py": Disk quota exceeded (122)
[FAIL] rsync error: error in file IO (code 11) at receiver.c(298) [receiver=3.0.4]
[FAIL] rsync: connection unexpectedly closed (1131 bytes received so far) [sender]
[FAIL] rsync error: error in rsync protocol data stream (code 12) at io.c(632) [sender=3.0.4]

[FAIL] fcm make -f /home/cwc46/cylc-run/u-bc286/work/19880901T0000Z/fcm_make_pp/fcm-make.cfg -C /home/cwc46/cylc-run/u-bc286/share/fcm_make_pp -j 4 mirror.target=login.archer.ac.uk:cylc-run/u-bc286/share/fcm_make_pp mirror.prop{config-file.name}=2 # return-code=2
Received signal ERR
cylc (scheduler - 2018-10-22T15:00:15Z): CRITICAL Task job script received signal ERR at 2018-10-22T15:00:15Z
cylc (scheduler - 2018-10-22T15:00:15Z): CRITICAL failed at 2018-10-22T15:00:15Z

Could I ask for more disk space on ARCHER please? Thanks!

comment:8 Changed 12 months ago by grenville

Glen

Increased to 100GB - it may take a short time to be usable.

Grenville

comment:9 Changed 12 months ago by cwc46

Dear Grenville,

I am just trying to run u-bc158 again but it's failing at atmos-main . The error message from the 'job activity log' gives me:

[job-submit cmd] cylc jobs-submit —host=login.archer.ac.uk —remote-mode — '$HOME/cylc-run/u-bc158/log/job' 19880901T0000Z/atmos_main/10
[job-submit ret_code] 191
[job-submit out] 2018-10-29T23:28:50Z|19880901T0000Z/atmos_main/10|191|None
(login.archer.ac.uk) 2018-10-29T23:28:50Z [STDERR] qsub: Job rejected by all possible destinations
[(('event-mail', 'submission failed'), 10) ret_code] 0

Hmm..

comment:10 Changed 12 months ago by cwc46

Dear Grenville,

I changed the queue from 'short' to 'standard' and it is ok now, thank you!

Best wishes
Glen

comment:11 Changed 12 months ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.