Opened 3 months ago

Closed 6 weeks ago

#2532 closed help (fixed)

UP_BOUND : Reached end of atmosphere LBC file

Reported by: ggxmy Owned by: um_support
Priority: highest Component: UM Model
Keywords: Cc:
Platform: ARCHER UM Version: 8.2

Description

My UM vn8.2 limited area job, tewnf, ran for 30 days but crashed just before it finishes 1 month simulation. /home/n02/n02/masara/output/tewnf000.tewnf.d18169.t154803.leave.20180623-003207 has these messages near the top;

???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: LBC_UPDATE
? Error Code:    11
Application 31211784 is crashing. ATP analysis proceeding...
? Error Message: UP_BOUND : Reached end of atmosphere LBC file
? Error generated from processor:     0
? This run generated *** warnings
????????????????????????????????????????????????????????????????????????????????

ATP Stack walkback for Rank 92 starting:
  _start@start.S:113
  __libc_start_main@libc-start.c:242
  main@flumeMain.f90:48
  um_shell_@um_shell.f90:2349
  u_model_@u_model.f90:28
  lbc_coup_update_@lbc_coup_update.f90:10
  ereport64$ereport_mod_@ereport_mod.f90:53
  gc_abort_@gc_abort.F90:137
  mpl_abort_@mpl_abort.F90:46
  pmpi_abort__@0x162e34c
  MPI_Abort@0x16a0efd
  MPID_Abort@0x16ea291
  abort@abort.c:92
  raise@pt-raise.c:42
ATP Stack walkback for Rank 92 done
Process died with signal 6: 'Aborted'
Forcing core dumps of ranks 92, 24, 25, 0, 28, 26
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

_pmiu_daemon(SIGCHLD): [NID 01556] [c0-1c0s5n0] [Sat Jun 23 23:40:24 2018] PE RANK 26 exit signal Quit
_pmiu_daemon(SIGCHLD): [NID 01551] [c0-1c0s3n3] [Sat Jun 23 23:40:20 2018] PE RANK 93 exit signal Killed
[NID 01551] 2018-06-23 23:40:24 Apid 31211784: initiated application termination
tewnf: Run failed

similar messages are near the bottom as well, and they are after these lines;

 REPLANCA: UPDATE REQUIRED FOR FIELD 85
  REPLANCA - time interpolation for field  85
  time,time1,time2  732.,  336.,  1080.
  hours,int,period  723,  1,  12
  Information used in checking ancillary data set: position of lookup table in dataset: 46
  Position of first lookup table referring to data type  6
  Interval between lookup tables referring to data type  10  Number of steps 4
  STASH code in dataset  218   STASH code requested  218
 'start' position of lookup tables for dataset in overall lookup array  1462

So the problem might have happened while or after process 0 tried to update lateral boundary condition? Please can I have advice on this?

Thanks,
Masaru

Change History (35)

comment:1 Changed 3 months ago by grenville

Masaru

You have run out of lbcs - you only have lbcs til 2011/05/31 - you need to extend the time over which lbcs are available.

Grenville

comment:2 Changed 3 months ago by grenville

Have you changed the UMUI set up since submitting the job - tewnf says to run for 1 day and 1hr, but it's run for 1 month?

comment:3 Changed 3 months ago by ggxmy

Hi Grenville,

How do I extend the time over which lbcs are available?

I can't remember for sure, but I think I had submitted the job for 1 day and because it ran OK, I made the run length longer on ARCHER side by modifying CNTLALL, SUBMIT, umuisubmit_run. Should I have set the run length on UMUI rather than on ARCHER? Or is there anything else I need to do on ARCHER side?

Masaru

comment:4 Changed 3 months ago by willie

Hi Masaru,

The original job was designed to run for 153 days in 30 day chunks with an LBC file for each chunk. The bottom script (Model Selection → Input/Output Control … → Scripts inserts …) handles the transition from one LBC to another. Looking back at the original xlhub it was taking nearly 42 hours in the ARCHER long queue to do the 30 days.

Regards
Willie

comment:5 Changed 3 months ago by ggxmy

Hi Willie,

Thanks for the note. It takes only about 24 hours to run for 30 days now. But I submit the job to the long queue anyway.

The UMUI page you point to shows dummy_script and lbc_update_v3.scr, and I have both of these in my /work/n02/n02/masara/ directory. lbc_update_v3.scr contains lines;

LBC_name[1]="xklhf_1.lbc"
LBC_name[2]="xklhf_2.lbc"
LBC_name[3]="xklhf_3.lbc"
LBC_name[4]="xklhf_4.lbc"
LBC_name[5]="xklhf_5.lbc"
LBC_name[6]="xklhf_6.lbc"

and these files are contained in /work/n02/n02/masara/xklhf_makebc/ . I'm not sure if the model knows where to look for these files, but when I didn't have xklhf_makebc directory it gave me an error for that. So I guess it knows. So things look OK. Can you think of anything else I should do or check?

Masaru

comment:6 Changed 3 months ago by willie

Hi Masaru,

The LBC update script is custom designed for the SWAMMA 4km runs as originally envisaged. If you are just doing a repeat, then it should work without problems. If you are modifying the run lengths then the update scripts might not be appropriate.

What are you trying to do?

Regards
Willie

comment:7 Changed 3 months ago by ggxmy

HI Willie,

I just wanted to run the job for 5 months but it failed at day 30 (0530). Outputs for that date has no value. I think in this case I submitted the run for 1 month and 1 day. (I usually do this to make sure to get monthly mean outputs. In this case though I haven't requested any monthly mean fields so doing this might not have any benefit.) Should I have stuck on 30 days? Could I use automatic resubmission after the initial 30 days?

If possible I want to run for 153 days as 4 40-day runs instead of 6 30-day runs and that will save queuing time. Can't it be done simply by changing resubmission pattern?

Masaru

comment:8 Changed 3 months ago by willie

Hi Masaru,

You should stick to 30 days. I didn't have a problem with the monthly means in the past. With hindsight, I should have made the LBCs a bit longer for doing short runs.

So do a 30 day NRUN, then switch reconfiguration off and do the continuation run. Barring ARCHER issues, it should run straight through.

Regards
Willie

comment:9 Changed 3 months ago by ggxmy

Hi Willie,

I tried to run a CRUN hoping it starts from where it crashed, but it didn't work. I got this;

? Error in routine: inbounda
? Error Code:   101
? Error Message:  Boundary data starts after start of current boundary data interval
? Error generated from processor:     0

Do I need to redo the first month as NRUN?

Thanks,
Masaaru

comment:10 Changed 3 months ago by willie

Hi Masaru,

The model is now in the wrong state, so you have to do the NRUN from the start for exactly 30 days.

Williw

comment:11 Changed 3 months ago by ggxmy

OK. Thanks.

So I ran tewnh for the first 30 days as NRUN again. That seemed successful and was completed yesterday. Then, should I go back to UMUI and submit the job as CRUN for another 30 days? That's what I did, but by checking the outputs I was a bit confused.

-rw-r--r-- 1 masara n02    8199264 Jul  4 13:41 tewnha.pg20110629
-rw-r--r-- 1 masara n02    8199264 Jul  4 13:41 tewnha.pa20110629_12
-rw-r--r-- 1 masara n02   70329112 Jul  4 13:41 tewnha.pe20110629_18.nc
-rw-r--r-- 1 masara n02    8199264 Jul  4 13:41 tewnha.pe20110629_18
-rw-r--r-- 1 masara n02 1357354715 Jul  4 13:41 tewnha.pa20110629_12.nc
-rw-r--r-- 1 masara n02     302486 Jul  4 13:41 ioserver_stash_log.0024
-rw-r--r-- 1 masara n02 1691083970 Jul  4 13:41 tewnha.pj20110530.nc
-rw-r--r-- 1 masara n02 2780038601 Jul  4 13:41 tewnha.pi20110530.nc
-rw-r--r-- 1 masara n02 2786508116 Jul  4 13:41 tewnha.ph20110530.nc
-rw-r--r-- 1 masara n02 3240019171 Jul  4 13:41 tewnha.pg20110629.nc
-rw-r--r-- 1 masara n02 3289918439 Jul  4 13:41 tewnha.pg20110530.nc
-rw-r--r-- 1 masara n02   51687406 Jul  4 13:41 tewnha.pf20110530_18.nc
-rw-r--r-- 1 masara n02   84990642 Jul  4 13:41 tewnha.pe20110530_18.nc
-rw-r--r-- 1 masara n02 3232133400 Jul  4 13:41 tewnha.pd20110530.nc
-rw-r--r-- 1 masara n02  765539097 Jul  4 13:41 tewnha.pc20110530_12.nc
-rw-r--r-- 1 masara n02 3315531775 Jul  4 13:41 tewnha.pb20110530.nc
-rw-r--r-- 1 masara n02 1360680811 Jul  4 13:41 tewnha.pa20110530_12.nc

at first I thought the job had run for May again but it actually produced outputs for June. Getting outputs for 0530 after 0629 looks strange but is it OK?

The run stopped here successfully, but automatic resubmission failed and left the same message as above;

? Error in routine: inbounda
? Error Code:   102
? Error Message:  Boundary data ends before end of current boundary data interval
? Error generated from processor:     0

Is this a problem or not? Do I just need to go back to UMUI and submit CRUN again? Do I need to repeat this for every 30 days of simulation?

Thanks.
Masaru

comment:12 Changed 3 months ago by ggxmy

Oh no! I resubmitted tewnh as CRUN from UMUI but it crashed leaving the same message as above. Maybe I simply don't know how to submit CRUN because I have never done that in the last few years. Please could I have advice?

Thank you.
Masaru

comment:13 Changed 3 months ago by willie

Hi Masaru,

It looks like the the model is in a confused state. You should delete the working directory and do a NRUN again.

To do an NRUN, you need to switch off automatic resubmission (in the Resubmission pattern panel) and switch reconfiguration on.

To do a CRUN, when the NRUN is complete, you need to switch off reconfiguration and switch on automatic resubmission to allow the run to be continued. It will then run repeatedly until the end date (CRUN).

Regards
Willie

comment:14 Changed 3 months ago by ggxmy

Hi Willie,

By working directory did you mean ARCHER:/work/n02/n02/masara/um/tewnh ? Should I delete this directory?

Masaru

comment:15 Changed 3 months ago by willie

Hi Masuaru,
Yes, the idea is to delete any incorrect history and give it a fresh start.
Willie

comment:16 Changed 3 months ago by ggxmy

Thank you, Willie.

Because transferring the outputs for June was taking a while, I copied the job to a different jobid (tewnl) and submitted. Let's see how it goes.

In the meanwhile I kept tewnf going only hoping to get results for June. It is exactly in the same situation as tewnh and so I had expected it would crash at the end of June. But it actually seems to be resubmitted OK and now simulating the third month (July). Lucky isn't it?

Masaru

comment:17 Changed 2 months ago by ggxmy

I was waiting for a reply to the comment I made yesterday but my comment itself does not seem to be here… It might not have gone through. The situation remains the same today.

I'm still struggling with this problem. Automatic resubmission fails with tewnh and another job again and again. I gave it another try after deleting the entire directories archer:/work/n02/n02/masara/um/tewnh and puma:/home/ggxmy/um/um_extracts/tewnh . At this point it is still running for NRUN so I will have to wait another day or two.

ON the other hand, as I mentioned, tewnf was going OK up until 0927 where the final resubmission failed. It left the same error message as above (error code = 102). I manually resubmitted it as a CRUN from UMUI but it didn't work and left the same error again. Could you please help me for tewnf now?

Thank you.
Masaru

comment:18 Changed 2 months ago by ggxmy

A couple of NRUNs have finished now. I was expecting if they fail they would do so at the end of the first CRUN (end of June) and wasn't worrying about NRUN. But actually I found these NRUNs had not finished without a problem… All of tewnh, tewnm and tewnn crashed at their 30th day of simulation leaving the same error at the top of this ticket (Error Code: 11). This is really a shocking result to me… Can you help me please?

Masaru

comment:19 Changed 2 months ago by ggxmy

  • Priority changed from high to highest

comment:20 Changed 2 months ago by willie

Hi Masaru,

This ticket is really confusing now. You are running too many models tewnf,h,l,m and n for me to see what's going on.

I think tewnl was the one that you ran from the start with 30 day CRUNs. Is this still working?

All the others I think are irrelevant - there being no difference between them except how they were initiated. It is really important if a run fails that you do not resubmit it without first finding out the reason for the failure. Just resubmitting will fail and muddy the waters.

Regards
Willie

comment:21 Changed 2 months ago by ggxmy

Even though I ran a few jobs, all other than tewnf have encountered the same problem. The problem with tewnf is basically the same also and only happened at a different stage. If the problem with tewnh is resolved, then the problem for all other jobs will likely be resolved. So the situation may appear to be complicated at the first glance, it is actually not.

Because I deleted all relevant directories, I thought they were quite likely to run, and I expected that all jobs would run for at least 60 days. If they run for 60 days I can make some progress in scientific analyses. I have spent too much time on trials and errors in running the UM jobs and I have no time to be waiting until I see tewnh runs OK before I submit other jobs. That's why I ran multiple jobs at the same time.

The ticket is very long now but message is simple. I'm still getting errors 11, 101 and 102. The job I want to be checked is tewnh. Can you do that?

Masaru

comment:22 Changed 2 months ago by willie

Hi Masaru,

I have taken a copy of your tewnh job and will try to run it for 35 days (see my job xobtc).

As I was doing this I realised the instructions I gave for NRUNS/CRUNS were not complete. I am sorry about that.

It turns out that there is a key switch on the Compilation and Run Options → Compile and run options for Atmosphere page.

To do an NRUN, deselect "run the configuration". This activates the "Type of Run" section. Then tick NRUN. You can then reselect "run the configuration". Then Save, Process and Submit.

To do a CRUN, deselect "run the reconfiguration" and then select CRUN. Then Save, Process and Submit.

So on the Input/Output? Control and Resources → Re-submission page, the "automatic resubmission" button should be selected throughout NRUNS and CRUNS.

To run for 35 days, set this on the Start date and run length options page. This never changes.

I expect xobtc to complete in a couple of days.

Once again, I am sorry for confusing you.

One other thing I noticed was that the Ancil Headers had been set to 100,000 (In the Ancillary → Infile related options → Header record sizes page). In my run I have changed this back to 50,000 as it was in xlhub.

Regards
Willie

comment:23 Changed 2 months ago by ggxmy

Thanks, Willie. Okay so tewnf might have happened to run because I had submitted it before I followed your previous instruction. I just resubmitted tewnh. I hope it runs OK this time. Fingers crossed.

Masaru

comment:24 Changed 2 months ago by willie

Hi Masaru,

The job xobtc has completed the 35 days successfully: one NRUN followed by one CRUN. See the leave files at /home/n02/n02/wmcginty/output/xobtc*.

There are some netCDF errors reported but these can be ignored since the variables in question don't need to be output.

Regards
Willie

comment:25 Changed 2 months ago by ggxmy

Thanks, Willie. If 30 day simulation finishes successfully, the next hurdle is the first automatic resubmission which takes place after finishing the day 60 and before starting the day 61. That is when I got errors 101 above.

My tewnh also finished the 30 day simulation successfully, but CRUN might have been somehow killed without leaving any output at all…

-rw------- 1 masara n02         0 Jul 13 03:05 tewnh000.tewnh.d18193.t185022.leave

tewnh.fort6.pe0 has the newest time stamp of all pe outputs and it appears to me to finish abruptly and I don't see any suggestion of a problem. Its last few lines are like this;

 Minimum theta level 1 for timestep  22442
            This timestep           This run
   Min theta1     proc      Min theta1 timestep
     286.447     112    283.771 19704
   
  Maximum vertical velocity at timestep  22442       Max w this run 
    w_max   level  proc     run w_max level timestep
   0.330E+01  30   43  0.507E+01   18 19405

I checked a couple of other pe outputs but they look similar. It seems to have been simulating the day 0608 that time. I have never seen anything like this. Are you aware of any disruption on ARCHER this morning around 3? http://www.archer.ac.uk/status/ doesn't show anything to me.

Should I resubmit the same CRUN? I'll try it anyway.

Thanks,
Masaru

comment:26 Changed 2 months ago by ggxmy

this may be the problem…

User disk info	
Volume	        Usage	        Quota	         Files
work (fs2)	20,001 GiB	20,000 GiB	55,999 Files

I'll delete some of the files before resubmitting the job.

comment:27 Changed 2 months ago by ggxmy

I never had a chance to ask this. Is it OK to delete /work/n02/n02/masara/um/[jobid]/core.* files? I think long time ago I saw somebody using the core files but I've never used them myself or never learnt how to use them. Also, is it OK to delete [jobid]a.p?2011mmdd(_hh) files (?=a-j)? I know I can delete many of the .da files although I want to keep some of them.

Thanks,
Masaru

comment:28 Changed 2 months ago by ggxmy

tewnh crashed in day 0629 leaving the same error as the top of this ticket (error code 11)… The situation is very similar to before if not exactly the same. I got the same error at a bit different timing. It crashed at the end of first CRUN period instead of the end of the NRUN period.

Masaru

comment:29 Changed 2 months ago by ggxmy

Willie, can you test your job for another two month as CRUN with automatic resubmission turned on?

Masaru

comment:30 Changed 2 months ago by ggxmy

I found the following line in SCRIPT and umuisubmit_run

export lbc_increment=0,1,0,0,0,0

Is this OK? Shouldn't this be "export lbc_increment=0,0,30,0,0,0" because I use 365 day calendar?

Thanks.
Masaru

comment:31 Changed 2 months ago by willie

Hi Masaru,

xobtc is now running for a total of 65 days. This will take it past the 29th June. I expect this to finish normally. It should not be necessary to edit any parameters to make it run.

The CRUNs rely on the existence of the daily dumps, as well as other history files to restart from. For this reason you should keep all the dumps until you are sure that the entire run has worked.

The core* files are created sometimes when the model crashes. They can be important for determining the cause of failures and correcting them before deciding how to restart the model. You can delete them if you wish.

Regards
Willie

comment:32 Changed 2 months ago by ggxmy

keeping all the dump files did not solve the problem… The simulation seems to have finished for 60 days, two 30-day periods, or the first CRUN period, but then the first automatic resubmission failed. The error code is 102 so this is not a new problem.

This happened regardless lbc_increment. tewnh with "export lbc_increment=0,0,30,0,0,0" and tewni "export lbc_increment=0,1,0,0,0,0" failed in the same way.

If your (Willie's) copy of tewnh runs OK but mine does not, what can be the problem? Can it be my personal profile? Was there anything I had to do before running these jobs?

Masaru

Last edited 2 months ago by ggxmy (previous) (diff)

comment:33 Changed 2 months ago by ggxmy

Willie,
Have you let your job run until the end? How did it go? If it finishes OK could you please let me use the outputs? Also isn't it possible for you to run two other jobs for me? Would it cost you a significant amount of MAU (budget)?
Masaru

Last edited 2 months ago by ggxmy (previous) (diff)

comment:34 Changed 7 weeks ago by ggxmy

Mohit Dalvi modified the lbc_update_*.scr script a little bit for me and I tried running tewnh with it. It finally went beyond the 60 day barrier and now running for month 09! So I'm running two other jobs in the same way as well. I hope they all run OK until the end.

I will wait a little bit more and see how tewni and tewnn go before I close this ticket.

Masaru

comment:35 Changed 6 weeks ago by ggxmy

  • Resolution set to fixed
  • Status changed from new to closed

Now 3 of 4 jobs finished. one crashed but it was due to the disk quota reached and not due to this problem. so I'm closing this ticket. Thank you for your support.

Masaru

Note: See TracTickets for help on using tickets.