Opened 10 years ago

Closed 8 years ago

#517 closed help (fixed)

Ensmble run using HiGEM1.2 is doing the job for one member at a time and getting terminated

Reported by: d.marathayil Owned by: simon
Component: UM Model Keywords: HiGEM, Ensemble run
Cc: Platform:
UM Version: 6.1

Description (last modified by ros)

Hi,

This is regarding the ensemble experiment using HiGEM1.2. I have followed the instruction given in http://ncas-cms.nerc.ac.uk/index.php/um-documentation/ncas-user-guides/1435-umcet to do my ensemble run. Successfully compiled the code and set for run job. For this job I gave 43200 sec in queuing up for 2 months run. The job got terminated by saying the run time exceeded wall time. When looked at the output files generated by the run, I could see, it has completed 2 months run for 1 ensemble member but not for other members. Could you please help me to find why it is different for each ensemble members? My job id is xfiug and you can find the ensemble members in /work/n02/n02/marathayil/xfiug/iod_2042. The .leave file for this job is in my umui_out folder and it is named as xfiug000.xfiug.d10277.t155222.leave.

Many thanks,
Deepthi

Change History (24)

comment:1 follow-up: Changed 10 years ago by simon

Hi,

Could you give me read permission for /work/n02/n02/marathayil/

Simon.

comment:2 in reply to: ↑ 1 ; follow-up: Changed 10 years ago by d.marathayil

Hi Simon,

I am sorry about that. But right now I am not able to access HECToR. I will change the permission at the earliest when I can access to HECToR.

Many Thanks,
Deepthi

Replying to simon:

Hi,

Could you give me read permission for /work/n02/n02/marathayil/

Simon.

comment:3 in reply to: ↑ 2 Changed 10 years ago by d.marathayil

Hi Simon,

I have changed the permission now.

Many thanks,
Deepthi

Replying to d.marathayil:

Hi Simon,

I am sorry about that. But right now I am not able to access HECToR. I will change the permission at the earliest when I can access to HECToR.

Many Thanks,
Deepthi

Replying to simon:

Hi,

Could you give me read permission for /work/n02/n02/marathayil/

Simon.

comment:4 Changed 10 years ago by simon

  • Owner changed from um_support to simon
  • Status changed from new to accepted

comment:5 Changed 10 years ago by simon

Hi,

I think I know what the problem is but I need to recompile the model to check. I will try once the compiler is working again on hector.

Simon.

comment:6 Changed 10 years ago by simon

Hi Deepthi,

I think I've solved the problem. You need to remove the mod
$SIMON_MODS/async_filter
and recompile your executable. This is a optimisation mod and it is incompatible with UMCET unfortunately. I've run a test for a day using your setup and all 8 members ran OK.

I'm sorry for the delay in fixing this but the xt4 hasn't been available or the compiler broken for most of the last month.

Simon.

comment:7 Changed 10 years ago by d.marathayil

Hi Simon,
I have created another ensemble job by turning off the mod file you mentioned to me. The initially I created a run job named xfmib in which I used .exec file compiled from xfmic job. It gave me error messages such as history file not found and Load module cannot find executable etc. When I discussed the same with my supervisor (Andy Turner) he found the error is associated with apron command where there is no number is specified. So I have created a new run job xfmid ( executable used xfmic) and changed the path $UMCET_SCRIPT to /work/n02/n02/agt/mods/umcet/ for the gen_env.mu and pum_full_6.1_umcet.mu. But now I have a different error message saying error in nupdate command.
The out put for my compilation run is xfmic000.xfmic.d10311.t11512.comp.leave
Output file for xfmib job is xfmib000.xfmib.d10311.t132831.leave
And that for xfmid job is xfmid000.xfmid.d10312.t092352.leave
All these files you can find in my umui_out directory.
Many thanks,
Deepthi

comment:8 Changed 10 years ago by simon

Hi,

In the umui, open the "Job submission" window and click on the "Use non-default number of cores per node" button and set the value to 4. Then revert the original script mods and submit
the model.

Simon.

comment:9 Changed 10 years ago by d.marathayil

Hi Simon,

It worked. Many thanks for your help.

Deepthi

comment:10 Changed 10 years ago by d.marathayil

Hi Simon,

The run again got terminated. It just run for 1 min and few seconds. Now the error it shows are different. It shows Error code 202 and Routine generating error Initial. It isshowing segmantation error also. And the other error is Deck qsexecuted not found in program Library (Modset GENENV)

My run job is xfmid and output saved in my umui_out directoary (xfmid000.xfmid.d10312.t175122.leave).

comment:11 Changed 10 years ago by d.marathayil

Hi Simon,

I think I figured out the problem. It is working now.

Thanks again for your help.

Deepthi

comment:12 Changed 10 years ago by d.marathayil

Hi Simon,

In my ensemble run (xfmid) I have set walltime as 43200 and gave for 2 months run. After 12hrs when I checked It only finished the run for April for all of my ensmble and 20days for May. In the .leave file it showed that the job terminated due to time exceeded walltime. I gave a qsubmit by editing the SUBMIT file in my Ensemble_control directory. It then started runing again and finished the simulation for May but didn't go any further. Do you know why it is happening? I need to do the ensmble for 1 year and few months.

Thanks,
Deepthi

comment:13 Changed 10 years ago by simon

Hi,

I can't see your output files due to their permissions. One thing you could do
is restart the run, this time using 50 resubmission rather than 2 months. Hopefully
then the resubmission should work.

Sorry for all the problems, this code has never been run with HiGEM before.

Simon.

comment:14 Changed 10 years ago by d.marathayil

Hi Simon,

I tried a new run (xfmie) with 1 month for resubmission. It finished the N run and I gave the C run. The model run for another month and didn't go any further.

you can find xfmie000.xfmie.d10313.t174743.leave inmy umui_out directory.

Thank you,
Deepthi

comment:15 follow-up: Changed 10 years ago by simon

Hi Deepthi,

Try changing gsm.ll.mu to gsm.qsub.mu in your script mods and
re-running.

Simon.

comment:16 in reply to: ↑ 15 Changed 10 years ago by d.marathayil

Hi Simon,

I have made a new job 'xfmig' with modification of gsm.qsub.mu. now the problem is even in the N run it exceeds walltime. I have looked at all the ensemble members and found 2nd 4th didn't finish completely. The resubmission time I set for this job is 1 month.

you can find the output in xfmig000.xfmig.d10319.t111310.leave.

thanks,
Deepthi
Replying to simon:

Hi Deepthi,

Try changing gsm.ll.mu to gsm.qsub.mu in your script mods and
re-running.

Simon.

comment:17 Changed 10 years ago by simon

Hi,

Can you give me read permission for that file.

Simon.

comment:18 Changed 10 years ago by simon

Hi again,

Don't bother, I can read the file now.

Simon.

comment:19 Changed 10 years ago by d.marathayil

Hi Simon,

Yesterday I tried to run my ensemble members of one of experiment individually. So I cheanged the modifications I made to setup an ensemble run and set the resubmission for two months. It finished the first month of my run (April) within 4hrs but for the second month (May) it couldn't finish within the remaining 8hrs and stoped saying exceeded walltime. When I checked the setup I found that I forgot to switch on $SIMON_MODS/async_filter which I kept in off for the ensemble run. By resetting that my job finished in expected time. I thought to let you know this so that you can check is the problem I face is anything related to that. I am not sure is that the real problem.

Thank you,
Deepthi

comment:20 Changed 10 years ago by simon

Hi Deepthi,

I took a copy of your ensemble and ran it myself overnight and all members completed a month and the run stopped cleanly. I will try and resubmit it once Hector is back up. The async_filter mod is a optimisation mod and it can be used when you're not using UMCET. If I understand correctly, your individual run didn't finish and it exceeded walltime is a similar way to your ensemble
members 2 and 4. It could be that is something else not connected with UMCET which is causing your jobs to freeze. I will continue to investigate.

Simon.

comment:21 Changed 10 years ago by simon

Hello again,

I left my version of your experiment running last week and over the weekend and it has now just finished December. My runid is xezsd. I think it is the same as your job. I could let it continue if the experimental set up is the same as your current one and then you can use my output data. You can use the UMUI to do a difference between my experiment and yours. If there are differences, let me know and I'll kill my job.

Simon.

comment:22 Changed 10 years ago by d.marathayil

Hi Simon,

I had a look at the differnce between the two jobs. The only difference I could find is the number of processors
Job xezsd: Entry is set to '8'
Job xfmig: Entry is set to '16'.

But then When I compared the SST for the first month (apr) form the ensemble run xezsd and the one I run individually are not exactly the same even though they have the same start dump data. Do you know why it is so?
Many thanks for your help.

Deepthi

comment:23 Changed 10 years ago by simon

Hi,

This is one of the few cases in which changing the number of processors changes the results. This is due to an ocean optimisation. Without this optimisation the ocean runs 2-3 times slower and it was decided to include it as otherwise the model runs far too slowly. The optimisation is in a global sum in the ocean solver and the results change at the bit level when you change the number of processors. It has been shown that this has no affect on the science of the model.

Simon.

comment:24 Changed 8 years ago by ros

  • Description modified (diff)
  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.