Opened 5 years ago

Closed 5 years ago

#1339 closed help (fixed)

.leave files empty during crun on ARCHER

Reported by: ih280 Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 7.3

Description

Hello,

I am running a CheST job with the UM@vn7.3.
I have no problems compiling the job and the nrun works fine, as does the crun for the first resubmission. But from the second resubmission onwards (jobid003 and later files) the .leave files are empty. The job continues to run and produces pp-files.

ih280@eslogin006:~/output> ls -l xjmxi0*
10543831 Aug 7 16:25 xjmxi000.xjmxi.d14219.t160227.comp.leave
99309307 Aug 8 05:33 xjmxi000.xjmxi.d14219.t160227.leave
99572043 Aug 8 18:19 xjmxi000.xjmxi.d14220.t093022.leave
99549184 Aug 9 03:48 xjmxi002.xjmxi.d14220.t181949.leave
0 Aug 9 12:08 xjmxi003.xjmxi.d14221.t034832.leave

Can you please tell me what may be the reason for this?
And can I trust the output from the .pp-files even though the .leave files are empty?
This happened for the last job I submitted and at a first glance the data produced look reasonable.

Thank you in advance.
Best,
Ines

Change History (7)

comment:1 Changed 5 years ago by willie

Hi Ines,

You have only completed 77680 time steps at 20 min, or ~3years. The xjmxi002 leave file has been terminated abruptly, perhaps due to running out of disk space? I am not sure what is happening, but it would be a good idea to review your CRUN setup - see http://cms.ncas.ac.uk/wiki/Docs/AutomaticResubmission.

I notice also that the model eventually crashes producing a 'core' file in /work/n02/n02/ih280/um/xjmxi. If you let me have read permissions on this file

  chmod g+rX /work/n02/n02/ih280/um/xjmxi/core

I might be able to get more information.

Regards

Willie

comment:2 Changed 5 years ago by ih280

Dear Willie,

Thanks for this reply.
I changed the permissions for the file.

How can there be .pm files up until the 7th year if the model terminated abruptly after time step 77680?

Best,
Ines

comment:3 Changed 5 years ago by willie

Hi Ines,

The core file was unhelpful - it was truncated too. If you do

 du -mshc /work/n02/n02/ih280/*

you will see that you have run out of quota. This is the primary cause of the problem I think. Your tmp directory is unusually large at 192GB, so you could delete all the contents of that. You'll probably have to log out and back in again afterwards. Your job xjmxi is currently taking 486GB and will be larger by the time it finishes: you can make an estimate.

On /home you may have run out of space too. This would account for the model still running but being unable to output the leave files.

So, delete some files and try again. Let us know if you need more disk space.

Regards,

Willie

comment:4 Changed 5 years ago by ih280

Hi Willie,

Thank you! I will free up some space.

Can I trust the data from the job that has been written before I ran out of disk space?

Best,
Ines

comment:5 Changed 5 years ago by willie

Hi Ines,

The .leave files are important for gaining trust in the output. If these contain errors then obviously the data is untrustworthy. In their absence you would have to base your confidence on previous runs of the same model. However, running the same model with different data can produce new errors.

Regards,

Willie

comment:6 Changed 5 years ago by ih280

Hi Willie,

Thank you for all this help!
I will have look into it.

Best,
Ines

comment:7 Changed 5 years ago by annette

  • Resolution set to fixed
  • Status changed from new to closed

Hi Ines,

I assume that this resolved your issue, so I am closing the ticket. If you have any further queries do please get in touch.

Annette

Note: See TracTickets for help on using tickets.