Opened 8 years ago
Closed 8 years ago
#1029 closed help (fixed)
CRUN problem
Reported by: | emxin | Owned by: | willie |
---|---|---|---|
Component: | UM Model | Keywords: | Automatic resubmission, NRUN, CRUN |
Cc: | Platform: | HECToR | |
UM Version: | 7.3 |
Description
Dear helpdesker:
Recently I experienced some problems with CRUN, though the NRUN seemed working successfully. The target length of each job is set to 9 months (for 12hrs walltime), but when I used CRUN to re-submit jobs, it seems the job ignored this target length and kept running until it hit the 12hrs limit, which means it can not make a continued run after this run. One of my jobs ID is xhjgy, the NRUN and CRUN reports (.leave files) are below (note: the job has been successfully run for about 17yrs, but stopped due to unknown reason. All the preoblem appeared after that time):
/home/n02/n02/emxin/um/umui_out/xhjgy000.xhjgy.d13056.t170012.leave
/home/n02/n02/emxin/um/umui_out/xhjgy000.xhjgy.d13053.t143342.leave
Bests
Xin
Change History (9)
comment:1 Changed 8 years ago by willie
- Owner changed from um_support to willie
- Platform changed from <select platform> to HECToR
- Status changed from new to accepted
comment:2 Changed 8 years ago by emxin
Hi there,
Thanks for the reply, but can you instruct me in more deatil on how to solve the problem? e.g how to switch on the FCM baranch hector archiving? to be honest, I do not know why/how I got this problem.
thanks
Xin
comment:3 Changed 8 years ago by willie
Hi Xin,
If you don't need the archiving, then in Post Processing > Main switch, select "No" for "Is automatic archiving required?".
If you do need archiving, then go to FCM configuration > FCM options for the UM and change the fcm:um_br/dev/jeff/VN7.3_HadGEM3-A_r2.0_hector_monsoon_archiving/src to be included.
Regards,
Willie
comment:4 Changed 8 years ago by emxin
Hi Willie,
After I switched on the hector_monsoon_archiving, the problem is still there. I even made a completely new run (by removing the old directory on hector) and then let it do a CUN. The job seems ignored the target length (which is 9 months in this case) for a 12 hrs walltime and kept running until it reached the 12 hrs limit and stpped.
Bests
Xin
comment:5 Changed 8 years ago by willie
Hi Xin,
Is this working now? I notice you've been running xhjgy in the past couple of days.
Regards,
Willie
comment:6 Changed 8 years ago by emxin
Hi Willie,
Sadly, it is not working. I have tried many things e.g. by making a completely new run or reducing time limit of the target job, but all failed. I am wondering if this problem is because of my job is too slow and can not finish a whole year integration in the 12 hours walltime? I used target length of 9 months for job limit of 43200 seconds. It is very strange as the job did not stop immediately after it finished the 9 10 months integration. It kept running until it reach 12hrs limit. Thus, there is no job re-submitted. Can you give further help?
Bests
Xin
comment:7 Changed 8 years ago by willie
- Keywords Automatic resubmission, NRUN, CRUN added
Hi Xin,
I took a copy of your job and modified it so that it ran for three days in chunks of one day. I also set it so that it output a dump at the end of each day. This ran fine. Each run takes a few minutes, so this is a good way to test your setup.
I found that once the initial NRUN is completed, then you have to edit the SUBMIT file on PUMA and then press the submit button on the UMUI - it is important not to SAVE or PROCESS.
Regards,
Willie
comment:8 Changed 8 years ago by emxin
Hi, the problem has been solved by copying a new .profile and .bashrc from a colleague. It seems the compilor I used on Hector was for pTOMCAT model and not for UKCA.
Xin
comment:9 Changed 8 years ago by willie
- Resolution set to fixed
- Status changed from accepted to closed
Hi Xin,
You have selected HECToR archiving, so you need to have the FCM branch hector_monsoon_archiving switched on.
I hope that helps.
Regards,
Willie