Opened 8 months ago

Closed 8 months ago

#3353 closed help (worksforme)

Back-end never delivered its pid (the same error as some other people had)

Reported by: ggxmy Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: Monsoon2
UM Version: 10.7

Description

Dear CMS,

My GC3.1 suite (u-bw985) failed with an error;

atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
_pmiu_daemon(SIGCHLD): [NID 05998] [c3-2c0s11n2] [Mon Aug 24 08:34:03 2020] PE RANK 1081 exit signal Aborted
atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.
[NID 05998] 2020-08-24 08:34:03 Apid 113531865: initiated application termination
[FAIL] run_model # return-code=137
2020-08-24T08:34:09Z CRITICAL - failed/EXIT

ocean.output has these lines this near its end;

 ===>>> : E R R O R
         ===========

  stpctl: the zonal velocity is larger than 20 m/s

I can see this is a common problem. In #3310 it is solved by perturbing atmospheric dump file while in #3251 by changing the number of convection calls. Currently I have n_conv_calls=02 in app/um/rose-app.conf. Would you suggest to increase this to 03? Wouldn't it affect the result in any significant way?

Or would you recommend to perturb the atmospheric dump file? Then could you please tell me how I can perturb it? I also wonder if we can increase the number of initial dump files and hence the size of an ensemble by perturbing an atmospheric dump file?

Thanks,
Masaru

Change History (9)

comment:1 Changed 8 months ago by dcase

Masaru,

if you want to go down the route of perturbing the atmospheric dump, I would look at these instructions https://code.metoffice.gov.uk/trac/moci/wiki/tips_CRgeneral , as your problem is discussed here (See restarting after a model blows up.

Further to this, if you'd like to see how this can produce an ensemble, look at suite u-bd149 for an example which may be similar to what you're after.

Hope this helps,

Dave

comment:2 Changed 8 months ago by ggxmy

Thanks, Dave.
Should I do this on xcslc0 or exppostproc01? Doing "module load um_tools" on xcslc0 doesn't return anything while on exppostproc01 that returns this;

ModuleCmd_Load.c(200):ERROR:105: Unable to locate a modulefile for 'um_tools'

Either way I cannot find perturb_theta.py

ll ~moci/bin/perturb_theta.py
ls: cannot access /home/d00/moci/bin/perturb_theta.py: No such file or directory

Is it located anywhere else on Monsoon or any other place that is accessible?

Masaru

comment:3 Changed 8 months ago by dcase

Ok,

If you want to get the script, I would just check it out of the repository, with:

fcm co fcm:moci.x-tr [name of directory you want to copy it to]

and find it in the Utilities/lib directory.

As for your module load, I think that if you module load um_tools/2019.01.1 then you will get something with python 2.7 and mule, so hopefully that's good. If the python isn't correct then look at older ones with module avail um_tools and maybe they'll work.

Dave

comment:4 Changed 8 months ago by ggxmy

  • Resolution set to answered
  • Status changed from new to closed

OK. Although I didn't mention I had done

fcm co fcm:moci.xm_tr/Utilities/lib/perturb_theta.py

and it had failed. It looks like I actually needed to do

fcm co fcm:moci.xm_tr/Utilities/lib/

Although I forgot to add '2019.01.1' it seems to have worked.

myosh@xcslc0:History_Data $ module load um_tools
myosh@xcslc0:History_Data $ python2.7 ~/lib/perturb_theta.py bw985a.da20310101_00_bu --output ./bw985a.da20310101_00
Perturbing Field - Sec 0, Item 388: ThetaVD After Timestep
myosh@xcslc0:History_Data $ ll -tr bw985a.da20310101_00*
-rw-r--r-- 1 myosh ukca-leeds 6240882688 Aug 22 10:03 bw985a.da20310101_00_bu
-rw-r--r-- 1 myosh ukca-leeds 6240882688 Aug 25 14:09 bw985a.da20310101_00

I'll try retriggering the run with this —- This seems to have worked!

Thank you so much for your help, Dave.
Masaru

comment:5 Changed 8 months ago by ggxmy

  • Resolution answered deleted
  • Status changed from closed to reopened

I reopened this ticket because I got the same problem with my other suite but it keeps failing for the same reason after perturbing the atmospheric start dump. The only difference of this suite (u-bx009) from the previous one (u-bw985) is the SO2 emission in UKCA. Now my second attempt failed.

The run fails at 20500101. bx009a.da20500101_00 is the latest dump file in History_Data/ directory. So I perturbed it and got this;

-rw-r--r-- 1 myosh ukca-leeds 6240882688 Aug 29 08:07 bx009a.da20500101_00_bu
-rw-r--r-- 1 myosh ukca-leeds 6240882688 Aug 31 10:18 bx009a.da20500101_00
$ diff bx009a.da20500101_00_bu bx009a.da20500101_00
Files bx009a.da20500101_00_bu and bx009a.da20500101_00 differ
-rw-r--r-- 1 myosh mo_users    136233 Aug 31 10:23 ocean.output
 ===>>> : E R R O R
         ===========

  stpctl: the zonal velocity is larger than 20 m/s
  ======
 kt=403243 max abs(U):  1.3553E+10, i j k:   118  289   10

The latest date in ocean.output is Y/M/D = 2050/01/02.

Can you see what's wrong?

Masaru

comment:6 Changed 8 months ago by dcase

Masaru,

looking at your files, it appears that you've managed to launch a job with model time 20541001 this morning, so I'm not looking into this more closely at the moment. Hopefully you have solved your problem? If not, let me know and I'll wait for this job to crash and look again.

Dave

comment:7 Changed 8 months ago by ggxmy

Hi Dave,

This is about u-bx009 and there is no change in the situation by now. It is still stopped at 205001.

Masaru

comment:8 Changed 8 months ago by dcase

Sorry, I looked at the wrong one yesterday.

You are correct: u-bx009 has run for 8 months, then crashed at the suspiciously round date of 20500101. I've looked at the meta-data for your files, such as /projects/ukca-leeds/myosh/ancils/n96e/eclipse/SO2_low_anthropogenic_ECLIPSE_V6b_CLE_base_ship100_2014_2101_time_series.nc, but not at the data itself. If I were you, I would recheck the input file data as a first step. Have these files been used in other simulations, or is this the first suite which includes them?

As for things like perturbing the system, this is ok if your system is generally stable and has drifted into an unstable region after a long time, but won't make a system stable if you have made large changes to input data. Possibly you could shorten the timestep to improve stability? You could also perturb and rerun from an earlier timestep, although, as I said, you should definitely check the effect of the input data first.

I'm sorry that I can't be more helpful, but I hope that this helps a little.

Dave

comment:9 Changed 8 months ago by ggxmy

  • Resolution set to worksforme
  • Status changed from reopened to closed

All input files and start dumps are used in other simulations as well, although the combination is unique. Following #3251 I changed n_conv_calls=3 and ran the suite for a month, which went OK. Then I changed it back to 2 and it is going.

Now the suite has run for a few months so hopefully the issue is at least avoided.

Many thanks for your help.
Masaru

Note: See TracTickets for help on using tickets.