Opened 5 weeks ago

Closed 2 weeks ago

#2934 closed help (fixed)

recon error in suite u-bj538

Reported by: xd904476 Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version:

Description

Hi, I have copied this suite from u-be699 but I have changed a setting in the landuse setting a stash request to false.
I keep getting an error in the reconfiguration though, but I am not sure about how to fix it. The error I get is:

[15] exceptions: An non-exception application exit occured.
[15] exceptions: whilst in a serial region
[15] exceptions: Task had pid=36227 on host nid04601
[15] exceptions: Program is "/work/n02/n02/dflocco/cylc-run/u-bj538/share/fcm_make_um/build-recon/bin/um-recon.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
Rank 15 [Tue Jun 11 15:44:24 2019] [c7-2c2s14n1] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 15
_pmiu_daemon(SIGCHLD): [NID 04601] [c7-2c2s14n1] [Tue Jun 11 15:44:24 2019] PE RANK 5 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 04661] [c0-3c0s13n1] [Tue Jun 11 15:44:24 2019] PE RANK 24 exit signal Aborted
[NID 04601] 2019-06-11 16:44:24 Apid 36158162: initiated application termination
[FAIL] um-recon # return-code=134
Received signal ERR
cylc (scheduler - 2019-06-11T15:44:28Z): CRITICAL Task job script received signal ERR at 2019-06-11T15:44:28Z
cylc (scheduler - 2019-06-11T15:44:28Z): CRITICAL failed at 2019-06-11T15:44:28Z

Could you help pls?

Thanks,
Dani

Change History (14)

comment:1 Changed 5 weeks ago by grenville

Dani

job.err says:

Error code: 30
? Error from routine: Rcf_Set_Data_Source
? Error message: Section 34 Item 8 : Required field is not in input dump!
? Error from processor: 4
? Error number: 1

so you need to add that field (a UKCA field ) - or switch off the bit of the model that thinks it's needed (if appropriate)

Grenville

comment:2 Changed 5 weeks ago by xd904476

Hi Grenville, thanks. I have probably found a way around that issue by adding another path to some ancils files, but I get another error now in the recon task.
The erros says:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 4001
? Error from routine: CHECK_IOSTAT
? Error message:
? Error reading namelist RUN_GWD
? IoMsg?: A READ operation tried to read past the end-of-file.
? Please check input list against code.
? Error from processor: 0
? Error number: 1
????????????????????????????????????????????????????????????????????????????????

[0] exceptions: An non-exception application exit occured.
[0] exceptions: whilst in a serial region
[0] exceptions: Task had pid=9519 on host nid04765
[0] exceptions: Program is "/work/n02/n02/dflocco/cylc-run/u-bj538/share/fcm_make_um/build-recon/bin/um-recon.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
Rank 0 [Thu Jun 13 19:10:02 2019] [c0-3c2s7n1] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0
_pmiu_daemon(SIGCHLD): [NID 04765] [c0-3c2s7n1] [Thu Jun 13 19:10:02 2019] PE RANK 0 exit signal Aborted
[NID 04765] 2019-06-13 20:10:02 Apid 36178906: initiated application termination
[FAIL] um-recon # return-code=137
Received signal ERR
cylc (scheduler - 2019-06-13T19:10:04Z): CRITICAL Task job script received signal ERR at 2019-06-13T19:10:04Z
cylc (scheduler - 2019-06-13T19:10:04Z): CRITICAL failed at 2019-06-13T19:10:04Z

I don't know what this means though. Any ideas?
Thanks,
Dani

comment:3 Changed 4 weeks ago by grenville

Dani

I'm a bit confused - what exactly did you change from u-bbe699?

Grenville

comment:4 Changed 4 weeks ago by xd904476

I changed the landuse setting to false: I commented out these lines in ./app/um/rose-app.conf

[namelist:items(4c515841)]
ancilfilename='$CMIP6….

I set the start files to the initial startdumps of suite u-be699 for 2014 and reactivated all the rebuild and recon switches in the suite configuration to rebuild and rerun the model as new.

perhaps the error is somewhere there?
thanks

comment:5 Changed 4 weeks ago by grenville

Dani

You have ainitial and astart pointing to the same file - that's a possible cause of error. astart is what the reconfiguration writes, ainitial is the input to the reconfiguration - my copy of u-be699 reconfigured OK with different file names- I see no reason why GWD should appear. Please fix this and rose suite-run —new.

Which landuse setting - pl tell us the variable name?

Grenville

comment:6 Changed 4 weeks ago by xd904476

Hi Grenville, thanks. I was confused as to why there were 2 initial conditions for the atmosphere. In any case, I put them equal because when I restarted the suite last time there was a jump in the results.
In this case, shall I set ainitial to an empty value?

About the landuse, the lines I wrote above, which can be found in ./app/um/rose-app.conf, need to be commented out so that the landuse file is read from the startdump only and is not updated (like in suite ar766).
This is the variable name: [namelist:items(4c515841)]

thanks
dani

comment:7 Changed 4 weeks ago by xd904476

Hi Grenville,
I setup another suite copied from u-be699 (u-bj799), adding the landuse fila back in the namelist, but I get the same GWD error:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 4001
? Error from routine: CHECK_IOSTAT
? Error message:
? Error reading namelist RUN_GWD
? IoMsg?: A READ operation tried to read past the end-of-file.
? Please check input list against code.
? Error from processor: 0
? Error number: 1
????????????????????????????????????????????????????????????????????????????????

[0] exceptions: An non-exception application exit occured.
[0] exceptions: whilst in a serial region
[0] exceptions: Task had pid=20275 on host nid00055
[0] exceptions: Program is "/work/n02/n02/dflocco/cylc-run/u-bj538/share/fcm_make_um/build-recon/bin/um-recon.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
Rank 0 [Mon Jun 17 20:05:54 2019] [c0-0c0s13n3] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0
_pmiu_daemon(SIGCHLD): [NID 00055] [c0-0c0s13n3] [Mon Jun 17 20:05:54 2019] PE RANK 0 exit signal Aborted
[NID 00055] 2019-06-17 21:05:54 Apid 36214320: initiated application termination
[FAIL] um-recon # return-code=137
Received signal ERR
cylc (scheduler - 2019-06-17T20:05:56Z): CRITICAL Task job script received signal ERR at 2019-06-17T20:05:56Z
cylc (scheduler - 2019-06-17T20:05:56Z): CRITICAL failed at 2019-06-17T20:05:56Z

I'm lost

comment:8 Changed 4 weeks ago by xd904476

Hi Grenville,
I corrected the astart and ainitial settings to

astart='${ROSE_DATA}/${RUNID}.astart'
ainitial='/nerc/n02/n02/dflocco/archive/u-be699/pert_dump/as371a.da20140101_00_pert1'

One suite (u-bj538) has the different landuse, while the other(u-bj799) has the same setting than u-be699. Both suites give me still the same RUN_GWD error as above.

Any other places I could check?
thanks,
dani

comment:9 Changed 4 weeks ago by grenville

Dani

My copy of your u-bj799 reconfigures (I fixed the errors showing up in Rose edit). I used
astart='${ROSE_DATA}/${RUNID}.astart'
ainitial=/work/n02/n02/dflocco/startdump/as371a.da20140101_00

You can't use files on /nerc - the compute nodes can not see /nerc

Grenville

comment:10 Changed 4 weeks ago by grenville

Please set RCF_PRINTSTATUS to Extra diagnostic messages (um→env→runtime…→reconfiguration only) & run the reconfiguration again -

Grenville

comment:11 Changed 4 weeks ago by xd904476

Hi Grenville, done that. In any case the suite u-bj538 had the startdump in the /work directory and it was giving me the same problem.
I have found a missing quote in a ukca file though by opening the rose suite, maybe that was the problem. I have just set both suites to run again: fingers crossed…

thanks

comment:12 Changed 4 weeks ago by xd904476

Hi Grenville, done that. In any case the suite u-bj538 had the startdump in the /work directory and it was giving me the same problem.
I have found a missing quote in a ukca file though by opening the rose suite, maybe that was the problem. I have just set both suites to run again: fingers crossed…

thanks

comment:13 Changed 4 weeks ago by xd904476

Hi Grenville, I tracked down the error to this.

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!
? Error code: 800
? Error from routine: Rcf_Ancil_Atmos
? Error message: replanca_rcf_replanca:ERR:LAND FRAC & MASK ARE INCONSISTENT
? Error from processor: 0
? Error number: 1
????????????????????????????????????????????????????????????????????????????????

I'm now checking on the differences with Till's PI control ar766 to see whether I find errors.
If anything obvious spreads to mind, pls let me know.
Best,
Dani

Last edited 4 weeks ago by xd904476 (previous) (diff)

comment:14 Changed 2 weeks ago by xd904476

  • Resolution set to fixed
  • Status changed from new to closed

Hi, suite works with 2015 startdump from u-be699 forcings.

thanks,
dani

Note: See TracTickets for help on using tickets.