Opened 7 years ago

Closed 6 years ago

#1029 closed help (fixed)

CRUN problem

Reported by: emxin Owned by: willie
Component: UM Model Keywords: Automatic resubmission, NRUN, CRUN
Cc: Platform: HECToR
UM Version: 7.3

Description

Dear helpdesker:

Recently I experienced some problems with CRUN, though the NRUN seemed working successfully. The target length of each job is set to 9 months (for 12hrs walltime), but when I used CRUN to re-submit jobs, it seems the job ignored this target length and kept running until it hit the 12hrs limit, which means it can not make a continued run after this run. One of my jobs ID is xhjgy, the NRUN and CRUN reports (.leave files) are below (note: the job has been successfully run for about 17yrs, but stopped due to unknown reason. All the preoblem appeared after that time):
/home/n02/n02/emxin/um/umui_out/xhjgy000.xhjgy.d13056.t170012.leave
/home/n02/n02/emxin/um/umui_out/xhjgy000.xhjgy.d13053.t143342.leave

Bests
Xin

Change History (9)

comment:1 Changed 7 years ago by willie

  • Owner changed from um_support to willie
  • Platform changed from <select platform> to HECToR
  • Status changed from new to accepted

Hi Xin,

You have selected HECToR archiving, so you need to have the FCM branch hector_monsoon_archiving switched on.

I hope that helps.

Regards,

Willie

comment:2 Changed 7 years ago by emxin

Hi there,

Thanks for the reply, but can you instruct me in more deatil on how to solve the problem? e.g how to switch on the FCM baranch hector archiving? to be honest, I do not know why/how I got this problem.

thanks
Xin

comment:3 Changed 7 years ago by willie

Hi Xin,

If you don't need the archiving, then in Post Processing > Main switch, select "No" for "Is automatic archiving required?".

If you do need archiving, then go to FCM configuration > FCM options for the UM and change the fcm:um_br/dev/jeff/VN7.3_HadGEM3-A_r2.0_hector_monsoon_archiving/src to be included.

Regards,

Willie

comment:4 Changed 7 years ago by emxin

Hi Willie,

After I switched on the hector_monsoon_archiving, the problem is still there. I even made a completely new run (by removing the old directory on hector) and then let it do a CUN. The job seems ignored the target length (which is 9 months in this case) for a 12 hrs walltime and kept running until it reached the 12 hrs limit and stpped.

Bests
Xin

comment:5 Changed 7 years ago by willie

Hi Xin,

Is this working now? I notice you've been running xhjgy in the past couple of days.

Regards,

Willie

comment:6 Changed 7 years ago by emxin

Hi Willie,

Sadly, it is not working. I have tried many things e.g. by making a completely new run or reducing time limit of the target job, but all failed. I am wondering if this problem is because of my job is too slow and can not finish a whole year integration in the 12 hours walltime? I used target length of 9 months for job limit of 43200 seconds. It is very strange as the job did not stop immediately after it finished the 9 10 months integration. It kept running until it reach 12hrs limit. Thus, there is no job re-submitted. Can you give further help?

Bests
Xin

comment:7 Changed 7 years ago by willie

  • Keywords Automatic resubmission, NRUN, CRUN added

Hi Xin,

I took a copy of your job and modified it so that it ran for three days in chunks of one day. I also set it so that it output a dump at the end of each day. This ran fine. Each run takes a few minutes, so this is a good way to test your setup.

I found that once the initial NRUN is completed, then you have to edit the SUBMIT file on PUMA and then press the submit button on the UMUI - it is important not to SAVE or PROCESS.

Regards,

Willie

comment:8 Changed 6 years ago by emxin

Hi, the problem has been solved by copying a new .profile and .bashrc from a colleague. It seems the compilor I used on Hector was for pTOMCAT model and not for UKCA.

Xin

comment:9 Changed 6 years ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.