Opened 2 years ago

Closed 16 months ago

#1748 closed help (answered)

My job has stopped as I have unexpectedly exceeded the time limit 10800

Reported by: s.varma13 Owned by: annette
Priority: high Component: UM Model
Keywords: UM Job Time Limit Cc:
Platform: MONSooN UM Version: 8.4

Description

Hello

I am having problems after I have submitted a run to Monsoon.

The job is xmbxa and is called “cloud reference run 2008 clouds 1” and I am running this on version 8.4.

I have a start date of 1 December 2006 with a run time of 2 years and 1 month.
I have selected 16 diagnostics (a mixture of both 2d and 3d).
My time profile is T1H which is every hour for the run time.
My domain profile is either DALLH or DIAG depending on whether it is 2d or 3d.
My usage profile is UPD, stream 60, override size of 32,000, period of 1 day (given the hourly output).
Resubmission pattern is one month.

When I submit the run for the first time, it outputs 30 files for December 2006. When I resubmit to do the continuous runs, it stops on the 19th day of January 2007.
You can see this here
cd /projects/ukca-imp/suvar/xmbxa
ls -ltr
xmbxaa.pa20070119

In the .leave file it says I have exceeded the time limit of the job:
/home/suvar/output/less xmbxa000.xmbxa.d15329.t204741.leave
⇒> PBS: job killed: walltime 10826 exceeded limit 10800
aprun: Apid 229471: Caught signal Terminated, sending to application
Application 229471 is crashing. ATP analysis proceeding…

Something is causing the job to take a lot longer to run as it only completes 1 month and 19 days of a 2 years and 1 month run.

I tried to change the usage profile to 63 as this is the one usually associated with UPD but when I do that a window pops up saying “disk quota exceeded”. It is still happy with 60 which has the same packing profile.

I also thought I could reduce the resubmission pattern from one month to 15 days as the run stops at day 19 under “input/output Control and Resources - resubmission pattern” and the same disk quota exceeded window came up.

I also tried to turn off the climate meaning under “control – post processing – dumping and meaning” as I do not need this and the same window came up.

Many thanks in advance for your help

Change History (18)

comment:1 Changed 23 months ago by annette

Hi Sunil,

If you are getting "disk quota" errors you may have exceeded your quota on puma. Run the quota command to see. To clear space, you can safely delete the directories under ~/um/um_extracts for example.

If you are reducing the resubmission frequency to 15 days, you should also adjust the dumping frequency to be 15 days, and this will affect the climate meaning so it's a good idea to switch off if you don't need it.

If your run is slow because of the diagnostics you are outputting, you may benefit from using the I/O server. To see if this is slowing down your job, switch on timer diagnostics:
Input/Output Control and Resources → Output choices, then select "Subroutine timer diagnostics".

Annette

comment:2 Changed 23 months ago by annette

  • Status changed from new to pending

comment:3 Changed 23 months ago by s.varma13

Hi Annette

Thank you for your reply.

I deleted the directories under um/um_extracts, turned off climate meaning and reduced the resubmission to 15, then 10 days and still the job failed. I will select "subroutine timer diagnostics" as you suggest and see what happens.

Sunil

comment:4 Changed 23 months ago by s.varma13

Hi Annette

I followed your instructions but the job is still crashing and I have no idea why.
In the new .leave file, xmbxb000.xmbxb.d15339.t175829.leave in [xcml00]/home/suvar/output it states:

"/projects/ukca-imp/suvar/xmbxb/bin/qsatmos: Executing model run

*
UM Executable : /projects/ukca-imp/suvar/xmbxb/bin/xmbxb.exe
*

Application 255213 is crashing. ATP analysis proceeding…
Rank 124 [Sat Dec 5 18:25:11 2015] [c0-0c2s5n0] application called MPI_Abort(MPI_COMM_WORLD, 9) - proces
s 124

ATP Stack walkback for Rank 124 starting:

_start@…:113
libc_start_main@…:242
flumemain_@…:48
um_shell_@…:1865
u_model_@…:2688

atm_step_@…:10120

atmos_physics2_@…:3965
ni_conv_ctl_@…:2384
_cray$mt_execute_parallel_with_proc_bind@0x1d7ee24
_cray$mt_start_one_code_parallel@0x1d7ea89
ni_conv_ctlcray$mt$p0001@…:2465
glue_conv$glue_conv_mod_@…:3692
ereport64$ereport_mod_@…:107
gc_abort_@…:136
mpl_abort_@…:46
pmpi_abort@0x1d89afc
MPI_Abort@0x1db3944
MPID_Abort@0x1ddee61
abort@…:92
raise@…:42

ATP Stack walkback for Rank 124 done
Process died with signal 6: 'Aborted'
Forcing core dumps of ranks 124, 0, 21
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

_pmiu_daemon(SIGCHLD): [NID 00149] [c0-0c2s5n1] [Sat Dec 5 18:26:24 2015] PE RANK 152 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00147] [c0-0c2s4n3] [Sat Dec 5 18:26:24 2015] PE RANK 11 exit signal Killed
[NID 00149] 2015-12-05 18:26:24 Apid 255213: initiated application termination
_pmiu_daemon(SIGCHLD): [NID 00148] [c0-0c2s5n0] [Sat Dec 5 18:26:24 2015] PE RANK 71 exit signal Killed
⇒> PBS: job killed: walltime 11262 exceeded limit 10800
============================= PBS epilogue =============================

End of Job Report
Run at 2015-12-05 21:21:37 for job 333356.xcm00
Submitted : 2015-12-05 18:13:55
Queued : 2015-12-05 18:13:55
Queued Time :
Elapsed Time : 03:07:42
Wallclock limit : 03:00:00
Requested Node Hours :
State : Running
Job Name : xmbxb_run
Owner : suvar
Group : users
Project : ukca-imp
STDOUT : nid00013:/home/suvar/output/xmbxb000.xmbxb.d15339.t175829.leave
STDERR : nid00013:/scratch/jtmp/pbs.333355.xcm00.x8z/xmbxb_run.e333356
Job Directory : /scratch/jtmp/pbs.333356.xcm00.x8z
Sandbox : private
Queue : normal
Job Arch :
Total Nodes : 7
Total Tasks : 193
Pset : pair=""

Executable Nodes Duration CPU Time Read (MB) Write (MB) RSS Memory Power (kWh)
============ ========== ========== ========== ========== ========== ========== ================
xmbxb.exe 3 751 138394.0 19119 1354 1125260 0.21
============ ========== ========== ========== ========== ========== ========== ================
total - 751 138394.0 19119 1354 1125260 0.21"

In the output file under [xcml00]/projects/ukca-imp/suvar/xmbxb it lists the following files:

atpMergedBT.dot core.atp.255213.0 pe_output/ xmbxba.pa20061201 xmbxb.out
atpMergedBT_line.dot core.atp.255213.124 umatmos/ xmbxba.pa20061202 xmbxb.stash
baserepos/ core.atp.255213.21 umrecon/ xmbxb.astart xmbxb.umui.nl
bin/ history_archive/ umscripts/ xmbxb.list xmbxb.xhist

xmbxba.pa20061201 and xmbxba.pa20061202 have output when I use xconv but does not produce the 13 files requested under the resubmission pattern.

Other than the changes we discussed, 15 days is the new re-submission pattern now. I have reduced the diagnostics to 5 (all 3D) and my usage profile is UPA, stream 60, override size of 32,000, period of 1 day (given the hourly output).

I have spent nearly a month trying to get this output - Please help!

Thanks

Sunil

comment:5 Changed 23 months ago by annette

  • Owner changed from um_support to annette
  • Status changed from pending to assigned

Sunil,

If you look at the pe-output file of the rank that crashed (124):

/projects/ukca-imp/suvar/xmbxb/pe_output/xmbxb.fort6.pe124

You will see the error message:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: glue_conv
? Error Code:     3
? Error Message: Mid conv went to the top of the model at point           28 in seg on call  1
? Error generated from processor:   124
? This run generated 351 warnings
????????????????????????????????????????????????????????????????????????????????

This is because the model has become unstable, and you will see that there are NaNs in the data points printed out above the error message.

To resolve this you can try reducing the time step length. You should also check any input data files you have added recently look OK.

Annette

comment:6 Changed 23 months ago by ros

  • Status changed from assigned to pending

Hi Sunil,

We have put a fix into the UMUI which should solve the slow running of your jobs. Please try running your job again.

If you are not recompiling your model executable you will need to go into the UMUI to window Compilation and Run options → UM Scripts Build and switch on "Enable build UM scripts". Save, Process and Submit as usual. This only needs to be done once and can be switched off for subsequent submissions.

Regards,
Ros.

Note for helpdesk: The fix adds export OMP_NUM_THREADS, calculates NTASKS_PER_NODE and specifies -N option to aprun command.

comment:7 Changed 22 months ago by s.varma13

Hi Ros and Annette

I ran the job again, having enabled build UM Scripts, and it has failed for the same reason. In the .leave file, xmbxb000.xmbxb.d15351.t221453.leave, it says:

"⇒> PBS: job killed: walltime 11808 exceeded limit 10800
opt/modules/default/init/bash: line 11: 31181 Terminated /var/spo
ol/PBS/mom_priv/jobs/357231.xcm00.SC
============================= PBS epilogue =============================

End of Job Report
Run at 2015-12-18 02:54:32 for job 357231.xcm00
Submitted : 2015-12-17 22:20:10
Queued : 2015-12-17 22:20:10
Started : 2015-12-17 23:37:44
Queued Time :
Elapsed Time : 03:16:48
Wallclock limit : 03:00:00
Requested Node Hours :
State : Running
Job Name : xmbxb_run
Owner : suvar
Group : users
Project : ukca-imp
STDOUT : nid00013:/home/suvar/output/xmbxb000.xmbxb.d15351.t2214
53.leave

STDERR : nid00013:/scratch/jtmp/pbs.357230.xcm00.x8z/xmbxb_run.e
357231
Job Directory : /scratch/jtmp/pbs.357231.xcm00.x8z
Sandbox : private
Queue : normal
Job Arch :
Total Nodes : 7
Total Tasks : 193
Pset : pair=""

Executable Nodes Duration CPU Time Read (MB) Write (MB) RSS Memory

Power (kWh)

============ ========== ========== ========== ========== ========== ========== =
===============
xmbxb.exe 6 4332 1655194.4 29983 39980 611296

2.46

============ ========== ========== ========== ========== ========== ========== =
===============
total - 4332 1655194.4 29983 39980 611296

2.46

xmbxb000.xmbxb.d15351.t221453.leave lines 59-74/74 (END)"

Could you please re-investigate?

Many thanks

Sunil

comment:8 Changed 22 months ago by s.varma13

Hi Ros and Annette

Further to my last comment, I ran a new job copied xmbxb but with a resubmission time of just one day instead of 15 and it also failed for the same reason. In the .leave file, xmbxf000.xmbxf.d15353.t233205.leave, it says:

*****************************************************************
*****************************************************************
     Job started at : Sun Dec 20 00:36:59 GMT 2015
*****************************************************************
*****************************************************************
     Run started from UMUI
Cloud reference run 2008 meaning off 10 diags 1 day
This job is using UM directory /projects/um1,
-------------------------------
Processing STASHC file for ROSE
-------------------------------
Backup of STASHC file created! 
/scratch/suvar/xmbxf.stashc_preROSE
-------------------------------
Processing STASHC file complete
-------------------------------
***************************************************************
   Starting script :   qsatmos
   Starting time   :   Sun Dec 20 00:36:59 GMT 2015
***************************************************************


/projects/ukca-imp/suvar/xmbxf/bin/qsatmos: Executing model run

*********************************************************
UM Executable : /projects/ukca-imp/suvar/xmbxf/bin/xmbxf.exe
*********************************************************


????????????????????????????????????????????????????????????????????????????????
!!???????????????????????????????? ATTENTION ????????????????????????????????!!
? This run generated 168 warnings
????????????????????????????????????????????????????????????????????????????????

=>> PBS: job killed: walltime 10861 exceeded limit 10800
============================= PBS epilogue =============================

End of Job Report
Run at 2015-12-20 03:37:58 for job 362696.xcm00
Submitted              : 2015-12-20 00:13:58
Queued                 : 2015-12-20 00:13:58
Started                : 2015-12-20 00:36:56
Queued Time            : 
Elapsed Time           : 03:01:01
Wallclock limit        : 03:00:00
Requested Node Hours   : 
State                  : Running
Job Name               : xmbxf_run
Owner                  : suvar
Group                  : users
Project                : ukca-imp
STDOUT                 : nid00013:/home/suvar/output/xmbxf000.xmbxf.d15353.t233205.leave
STDERR                 : nid00013:/scratch/jtmp/pbs.362693.xcm00.x8z/xmbxf_run.e362696
Job Directory          : /scratch/jtmp/pbs.362696.xcm00.x8z
Sandbox                : private
Queue                  : normal
Job Arch               : 
Total Nodes            : 7
Total Tasks            : 193
Pset                   : pair=""

Executable        Nodes   Duration   CPU Time  Read (MB) Write (MB) RSS Memory      Power (kWh)
============ ========== ========== ========== ========== ========== ========== ================
xmbxf.exe             6        328   121711.7      19206       2652     610184             0.18
============ ========== ========== ========== ========== ========== ========== ================
total                 -        328   121711.7      19206       2652     610184             0.18
Last edited 22 months ago by annette (previous) (diff)

comment:9 Changed 22 months ago by annette

  • Status changed from pending to assigned

Sunil,

If you look at the pe_output files it seems as though the model run has finished, so it's just the scripts that are hanging… One thing I noticed is that the job is set as a CRUN. What happens if you have it as an NRUN? Also change the time-limit to 20 mins rather than 3 hours so that it times out sooner.

Annette

comment:10 Changed 21 months ago by annette

  • Status changed from assigned to pending

comment:11 Changed 21 months ago by annette

Hi Sunil,

The problem is that you have switched on post-processing but are missing the required branch:

fcm:um-br/dev/ros/vn8.4_MetoCray_arch/src

See here for full instructions:

http://collab.metoffice.gov.uk/twiki/bin/view/Support/CrayUMInstall#Archiving

If you have compilation switched off, you can re-build the scripts by selecting the option "Enable build UM scripts" in Compilation and Run options → UM Scripts Build

I have tested this in a copy of your job (xlrdn) and it does seem to fix the timeout issue.

Best regards,
Annette

comment:12 Changed 21 months ago by annette

  • Resolution set to worksforme
  • Status changed from pending to closed

comment:13 Changed 20 months ago by s.varma13

  • Resolution worksforme deleted
  • Status changed from closed to reopened

Hi Annette

Thanks a lot for your message above. I have been away for the last two months so apologies for the delay in getting back to you. I will copy your job and let you know what happens.

Best wishes

Sunil

comment:14 Changed 19 months ago by annette

  • Resolution set to answered
  • Status changed from reopened to closed

comment:15 Changed 18 months ago by s.varma13

Hi Annette

I copied your job xlrdn to xmnjb.

I need to run the model for 2 years and one month. In your job, you had a re-submission plan of 1 day and just just ran the model. How do I get 2 years and one month's worth of output using your method?

I amended your run to a 5 days submission plan and then selected to compile the model with the view to then do a continuous run once the model had compiled. Using this method 5 daily output files were created but on reviewing the leave file - xmnjb000.xmnjb.d16116.t132524.leave - the job was killed because walltime 1235 exceeded limit 1200. Could you please let me know what you think the problem is?

Many thanks

Sunil

comment:16 Changed 18 months ago by s.varma13

  • Resolution answered deleted
  • Status changed from closed to reopened

Please note that I had initially thought the 5 days (xmnjb) run had worked so tried to compile the model on a one month time frame - xmnjc - which failed for the same reason.

comment:17 Changed 18 months ago by grenville

Sunil

This web page explains how to do automatic resubmission:

http://cms.ncas.ac.uk/wiki/Docs/AutomaticResubmission

Grenville

comment:18 Changed 16 months ago by annette

  • Resolution set to answered
  • Status changed from reopened to closed
Note: See TracTickets for help on using tickets.