Opened 11 months ago

Last modified 6 months ago

#1933 pending help

UM Rose 10.4 job with extra CASIM code fails on extract

Reported by: earhg Owned by: um_support
Priority: normal Component: UM Model
Keywords: Cc:
Platform: UM Version: <select version>

Description

I'm trying to make a nesting suite in Rose (job ID on PUMA is u-af357) to run on ARCHER, but am having trouble extracting and building CASIM code to go with this. The job is a copy of one that was run successfully on Met Office machines by Paul Field.

In Rose, I see BUILD_LOCAL fails, clicking this tells me fcm_make fails, and there is the following stderr:
[FAIL] casim: name-spaces declared but not used

[FAIL] fcm make -f /home/earhg/cylc-run/u-af357/work/20080830T1200Z/fcm_make/fcm-make.cfg -C /home/earhg/cylc-run/u-af357/share/fcm_make -j 4 —ignore-lock mirror.target=login.archer.ac.uk:cylc-run/u-af357/share/fcm_make mirror.prop{config-file.name}=2 # return-code=2
Received signal ERR
cylc (scheduler - 2016-08-03T13:41:53Z): CRITICAL Task job script received signal ERR at 2016-08-03T13:41:53Z
cylc (scheduler - 2016-08-03T13:41:53Z): CRITICAL failed at 2016-08-03T13:41:53Z

Paul and I thought the lines in the file earhg@puma:/home/earhg/roses/u-af357/app/fcm_make/rose-app.conf that start with
casim_sources=fcm:casim.xm_br/dev/jonathanwilkinson/r893_precip_and_substep_pass@1194
Might be the ones that are giving us trouble.
We can comment them out, and also comment out the line in earhg@puma:/home/earhg/roses/u-af357/app/fcm_make/file/fcm-make.cfg that reads extract.location{diff}[casim] = $casim_sources
Then the build is OK on puma, and then Rose sends it to archer and it fails.
I was able to run fcm checkout on the casim directory I point to above on puma to get the source code. Replacing the casim_sources by local copies linked in rose-app.conf (e.g. casim_sources=/home/earhg/casim/r893_precip_and_substep_pass) doesn’t help, however.

So it looks like there is some issue with checking out, packaging up the code and sending it to ARCHER. Does anyone know how I might get this to work? Many thanks for any help.

Change History (8)

comment:1 Changed 11 months ago by earhg

this is Hamish by the way…

comment:2 Changed 11 months ago by grenville

Hamish

Looks like you have fixed this — please let us know how for future reference.

Grenville

comment:3 Changed 11 months ago by earhg

hi Grenville,
I have certainly got further along with this, though I am not sure the simulations will actually run yet. I checked out a local copy of the UM that contained CASIM, merged in Annette's changes to fcm-make/ncas-xc30-cce from /dev/annetteosprey/vn10.4_archer, made sure it would merge with other branches my job uses, and manually edited config_root_path in app/fcm_make/opt/rose-app-ncas-cray-xc30.conf, config_root_path in app/fcm_make/rose-app.conf, and added /home/earhg/casim/vn10.4_CASIM-archer (my local copy) to the list of um_sources in app/fcm_make/rose-app.conf.
I am not sure all of these are necessary or could not be done from the GUI but the job builds both locally and on archer now.
I'll keep updating this ticket if anything interesting happens.
Many thanks
Hamish

comment:4 Changed 11 months ago by earhg

hi,
So this job u-af357 fails on the first time-step when running the global model (I don't think it gets to the nested part, or to any specific CASIM code). The failure is copied below. I found a kind of hand-edit in Rose that related to glue_conv (glm_um→namelist→UM science settings→short term logicals→l_glue_conv5a) and tried turning it on but it didn't help, as far as I can see the error is identical.
Please could someone have a look? Sorry to make so much noise on this forum in the last week…
Hamish

[0]
[0] ????????????????????????????????????????????????????????????????????????????????
[0] ???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
[0] ?  Error code: 3
[0] ?  Error from routine: GLUE_CONV_5A
[0] ?  Error message: Mid conv went to the top of the model at point           11 in seg on call  1
[0] ?  Error from processor: 260
[0] ?  Error number: 17
[0] ????????????????????????????????????????????????????????????????????????????????
[0]
[260] exceptions: An non-exception application exit occured.
[260] exceptions: whilst in a parallel region, by thread 0
[260] exceptions: Task had pid=11707 on host nid03701
[260] exceptions: Program is "/work/n02/n02/earhg/cylc-run/u-af357/share/fcm_make/build-atmos/bin/um-atmos.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked
Rank 260 [Mon Aug  8 15:42:26 2016] [c3-2c0s13n1] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 260
[0]
[0] ????????????????????????????????????????????????????????????????????????????????
[0] ???!!!???!!!???!!!???!!!???!!!       ERROR        ???!!!???!!!???!!!???!!!???!!!
[0] ?  Error code: 3
[0] ?  Error from routine: GLUE_CONV_5A
[0] ?  Error message: Mid conv went to the top of the model at point           13 in seg on call  1
[0] ?  Error from processor: 359
[0] ?  Error number: 17
[0] ????????????????????????????????????????????????????????????????????????????????
[0]
[359] exceptions: An non-exception application exit occured.
[359] exceptions: whilst in a parallel region, by thread 0
[359] exceptions: Task had pid=48382 on host nid04206
[359] exceptions: Program is "/work/n02/n02/earhg/cylc-run/u-af357/share/fcm_make/build-atmos/bin/um-atmos.exe"
Warning in umPrintMgr: umPrintExceptionHandler : Handler Invoked

lib-4212 : UNRECOVERABLE library error 
  An internal WRITE tried to write beyond the end of an internal file.

Encountered during a list-directed WRITE to an internal file (character variable)
Rank 359 [Mon Aug  8 15:42:26 2016] [c5-2c2s11n2] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 359

lib-4212 : UNRECOVERABLE library error 
  An internal WRITE tried to write beyond the end of an internal file.

Encountered during a list-directed WRITE to an internal file (character variable)

lib-4212 : UNRECOVERABLE library error 
  An internal WRITE tried to write beyond the end of an internal file.

Encountered during a list-directed WRITE to an internal file (character variable)
Last edited 6 months ago by ros (previous) (diff)

comment:5 Changed 11 months ago by earhg

Paul suggested to reduce the UM timestep for this run, I am trying this to see if it fixes the issue.

I note also that if jobs sit in the queue for 1 day or more the suite times out and they all die. On ARCHER this is presumably going to happen pretty often, so I edit the timeout in suite.rc from P1D to P4D.

comment:6 Changed 10 months ago by ros

Hi Hamish,

Did reducing the timestep help with this problem?

Regards,
Ros.

comment:7 Changed 10 months ago by ros

  • Status changed from new to pending

comment:8 Changed 10 months ago by earhg

hi Ros,
Thanks for looking at this again. Neither reducing the timestep nor changing the l_convection_vn convection version type from 5 to 6 helped, unfortunately. Can you think of anything else to try?

Note: See TracTickets for help on using tickets.