Opened 4 years ago

Closed 3 years ago

#1765 closed help (fixed)

Error in job when changing only STASH diagnostics

Reported by: dilshadshawki Owned by: annette
Component: UM Model Keywords: UMUI, STASH,
Cc: Platform: MONSooN
UM Version: 8.4

Description

Dear Helpdesk,

I have job that I tried to run called xlzyc that I copied from another job xlyta and after changing the STASH diagnostics, the job has some output but then I get the following error message in the following .leave file:

home/dshawk/output/xlzyc000.xlzyc.d15338.t151655.leave
ATP Stack walkback for Rank 131 starting:
  _start@start.S:113
  __libc_start_main@libc-start.c:242
  flumemain_@flumeMain.f90:48
  um_shell_@um_shell.f90:1865
  u_model_@u_model.f90:2688
  atm_step_@atm_step.f90:10120
  atmos_physics2_@atmos_physics2.f90:3965
  ni_conv_ctl_@ni_conv_ctl.f90:2384
  _cray$mt_execute_parallel_with_proc_bind@0x1d7ee64
  _cray$mt_start_one_code_parallel@0x1d7eac9
  ni_conv_ctl__cray$mt$p0001@ni_conv_ctl.f90:2465
  glue_conv$glue_conv_mod_@glue_conv-gconv5a.f90:1838
  ereport64$ereport_mod_@ereport_mod.f90:107
  gc_abort_@gc_abort.F90:136
  mpl_abort_@mpl_abort.F90:46
  pmpi_abort@0x1d89b3c
  MPI_Abort@0x1db3984
  MPID_Abort@0x1ddeea1
  abort@abort.c:92
  raise@pt-raise.c:42
ATP Stack walkback for Rank 131 done
Process died with signal 6: 'Aborted'
Forcing core dumps of ranks 131, 12, 0
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

_pmiu_daemon(SIGCHLD): [NID 00092] [c0-0c1s7n0] [Fri Dec  4 16:09:16 2015] PE RANK 95 exit signal Killed
[NID 00092] 2015-12-04 16:09:16 Apid 251553: initiated application termination
_pmiu_daemon(SIGCHLD): [NID 00110] [c0-0c1s11n2] [Fri Dec  4 16:09:16 2015] PE RANK 134 exit signal Killed
xlzyc: Run failed

I then wanted to double check the difference so I made a copy of the original job xlyta (now called xlzyd), and compared it to the modified job, xlzyc and this can be found in puma:

/home/dilshadshawk/umui_jobs/diff.xlzyc.xlzyd

Is there something strange that I am not spotting? I am quite baffled as not much has been changed from one job to the other.

Any help would be much appreciated.

Best wishes,
Dill

Change History (14)

comment:1 Changed 4 years ago by annette

Hi Dill,

You can look at the pe-output file of the rank that caused the crash (131):

/projects/ukca-imp/dshawk/xlzyc/pe_output/xlzyc.fort6.pe131

This shows an error message:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: glue_conv
? Error Code:     2
? Error Message: Deep conv went to model top at point           65 in seg   1 on
 call  1
? Error generated from processor:   131
? This run generated 150 warnings
????????????????????????????????????????????????????????????????????????????????

This is a model instability, and above this you will see NaNs have appeared in the data.

To fix you could try reducing the time step length.

Annette

comment:2 Changed 4 years ago by dilshadshawki

Hi Annette,

Thanks for getting back to me. Where can I reduce the time step length on the UMUI?

Dill

comment:3 Changed 4 years ago by annette

Dill,

Atmosphere → Scientific Paramters and Settings → Time-stepping

Are you working with Sunil Varma? They seem to be having the same problem: :ticket:1748#comment:4

Annette

comment:4 Changed 4 years ago by dilshadshawki

Hi Annette,

Thanks for showing me where there Time stepping section can be found.

Yes I am working with Sunil, I copied one his jobs to see if I could help him to get this model working as we are under some time pressure :-(

Just to clarify, I will change the number of time step per period to 96 (given a the number of days per period is set to 1), then this would give me a timestep length of (24*60)/96 = 15 minutes.

I guessed this from the fact that 72 time steps per period ends up giving you 20 minutes which I think is the standard timestep length for the global model configuration.

Hope what I said made sense! I'll let you know how this goes. Thanks again,

Dill


comment:5 Changed 4 years ago by dilshadshawki

Hi Annette,

The stepping seemed to work for a while, but then it never managed to finish the NRUN as the walltime had been exceeded. It managed to run and output 17 days worth of data. But then the .leave file gives the following error:

/home/dshawk/output/xlzyc000.xlzyc.d15341.t111929.leave
mkdir:: File exists
=>> PBS: job killed: walltime 10842 exceeded limit 10800
aprun: Apid 259173: Caught signal Terminated, sending to application
Application 259173 is crashing. ATP analysis proceeding...

There was nothing above this to suggest why the walltime limit was exceeded.

I have tried to reduced the time step length even further, by setting the time steps per period to 144, bringing the time step length down to 10 minutes. I hope that this will work.

In the meantime, do you have any other ideas as to what might be going?

Cheers,
Dill

comment:6 Changed 4 years ago by annette

Dill,

It will be slower as it has to do more time-steps to get to the same run length - if your reducing the timestep from 20mins to 15 mins you could expect it to take 25% longer.

Did you save the pe_output files from that run? Looks like they've been overwritten by your next attempt? Since there was nothing in the leave file we need to look in the pe_output to see if there was an error message.

Annette

comment:7 Changed 4 years ago by annette

  • Owner changed from um_support to annette
  • Status changed from new to assigned

comment:8 Changed 4 years ago by dilshadshawki

Unfortunately I have not saved the file and yes it has been overwritten now that I have attempted again with a 10 minute time step, so this will take 50% longer to run! I also changed the re-submission time to 15 days instead of 1 month.

I will remember not to delete it, next time I will just make another copy of the job to try a different test.

Let's see how far it gets on and tomorrow I will take a look in pe_output to see if there is an error. I will be in touch again then, hopefully I can change the timestep back to 20 minutes if it turns out that the timestep length is not the issue.

Many thanks for your help so far,
Dill

comment:9 Changed 4 years ago by dilshadshawki

Good Morning Annette,

The model managed to run for 14 days but the latest .leave file shows that the walltime was exceeded again, with no further info, so as you mentioned we can now see the pe_output directory for this job, but I am not sure which one I should be looking at. I looked through the files ending in .pe0 and .pe1 and they seem to have a lot of warnings but no errors.

Please could you take a look?

/projects/ukca-imp/dshawk/xlzyc/pe_output

Cheers,
Dill

comment:10 Changed 4 years ago by annette

Dill,

Looking at the output from pe0, the model has run 1939 timesteps in 3 hours. The number of timesteps it needs to run is 2160 (based on 144*15), so I think it has simply run out of time.

Going back to the original error, the leave file suggest the crash occurred pretty quickly after only 54 timesteps (so within the first model day). The recent run has got past the crash point, which is good news.

What I would suggest is reducing the run length to a comfortable limit, say 5 days for 10 minute timestep, or 10 days for a 15 minute timestep (assuming this just timed out as well), and see if it will run for a couple of months. Remember to change the dump frequency as well.

Then you could experiment with restarting from a later dump with the timestep back at 20 minutes. However be aware that change the timestep length affects the results. Also you should be aware that changing the dump frequencies alters the results also - you won't get identical results with different dump frequencies.

Annette

comment:11 Changed 4 years ago by dilshadshawki

Hey Annette,

Thanks for this, I will try the suggestions above and I will get back to you.

I also wanted to let you know that my colleague Matthew Kasoar had another job xlytb (a copy of a UKCA release job) which did work a month ago, but now has exactly the same errors as my job. could this mean that the UKCA 8.4 release job may not be working? If so, do you know if the UKCA team (e.g. Mohit or Luke) are aware of this?

Best,
Dill

comment:12 Changed 4 years ago by annette

Dill,

I don't know if there are any issues with the UKCA release job. I will forward this to Luke offline, but you should encourage Matthew to raise a separate ticket if there is an issue with the release job. If it has UKCA in the title it will be picked up by the UKCA support.

Annette

comment:13 Changed 4 years ago by ros

  • Status changed from assigned to pending

Hi Dill,

We have put a fix into the UMUI and UM scripts which should solve the slow running of your jobs.

If you are not recompiling your model executable you will need to go into the UMUI to window Compilation and Run options → UM Scripts Build and switch on "Enable build of UM scripts" to pick up the required script changes. Save, Process and Submit as usual. This only needs to be done once and can be switched off for subsequent submissions. Please also make sure that you are not specifying a revision number for the branch fcm:um-br/pkg/Config/vn8.4_ncas in order to pick up the new changes.

Regards,
Ros.

(Helpdesk note: See also #1766)

comment:14 Changed 3 years ago by annette

  • Resolution set to fixed
  • Status changed from pending to closed
Note: See TracTickets for help on using tickets.