Opened 3 months ago

Last modified 2 months ago

#3447 assigned help

KPP on Archer2

Reported by: EmmaHoward Owned by: annette
Component: UM Model Keywords: Archer2, KPP
Cc: ros Platform: ARCHER2
UM Version: 11.4

Description

Hi Annette,

Hope you’re doing well and had a lovely break! I’m looking at moving the Terramaris KPP simulations over to Archer2, is CMS planning to do an installation of KPP there? If so, can you keep me updated on the process? I’ve followed the instructions to copy most of the suite and the atmosphere installs okay, but I think I need a new makefile for KPP.

Also, has eccodes or grib-api been installed on Archer2 yet?

I'm also having trouble running my 2km suite on Archer - I'll submit a different ticket about that, hope that's okay.

Best wishes,
Emma

Change History (13)

comment:1 Changed 3 months ago by annette

  • Owner changed from um_support to annette
  • Status changed from new to assigned

Hi Emma,

I have run the KPP code on Archer2 but not from within a Rose suite. I will have a look at that next week and then I can give you some instructions.

Best wishes,

Annette

comment:2 Changed 3 months ago by annette

Hi Emma,

This taking a bit longer than I thought unfortunately, but I will update you later in the week.

Glad you are making progress with the other issue.

Annette

comment:3 Changed 3 months ago by EmmaHoward

Hi Annette,

No worries, I appreciate it's probably quite a nontrivial task.
Thanks for your help with this!

Emma

comment:4 Changed 2 months ago by EmmaHoward

Hi Annette,

Hope you're doing well. Have you had any luck with this?

Best,
Emma

comment:5 Changed 2 months ago by annette

Hi Emma,

Sorry this took a little while to sort out, plus Archer had to update the NCO tools package for us.

I have ported my regional India suite (u-bu108). You can see the differences here:

https://code.metoffice.gov.uk/trac/roses-u/changeset/186678/

Hopefully you can use this as a guide to port your suite, but let me know if you have any questions or issues.

Also eccodes is installed on Archer2.

Best wishes,

Annette

comment:6 Changed 2 months ago by EmmaHoward

Hi Annette,

That's brilliant, thank you! I'll give it a shot and let you know if I run into any problems.

Best,
Emma

comment:7 Changed 2 months ago by EmmaHoward

Hi Annette,

Those changes seem to run fine and I've got the job running and outputting correctly, except cylc doesn't seem to be able to track the coupled task and reports it as 'submit-failed' or 'failed' when its running fine over on Archer. I think the problem is that it's tracking the parent job-id rather than the child heterogenous ones.

I'm using version 6.11.4 of cylc on PUMA, whereas the following links suggest that more up-to-date versions of cylc can handle this. Does pumatest have a more updated version of cylc and do you think requesting and using a pumatest account might fix this issue?

https://cylc.discourse.group/t/slurm-heterogeneous-job-support-in-cylc/290/5
https://github.com/cylc/cylc-flow/issues/3964

Best wishes,
Emma

comment:8 Changed 2 months ago by ros

Hi Emma,

The cylc-7 fix for hetjobs was back-ported to cylc-6.11.4. Can you give me your suite id please and I'll take a look.

Cheers,
Ros.

comment:9 Changed 2 months ago by EmmaHoward

Hi Ros,

My suite ids are u-cb263 and u-cc339. (I've marked the task as complete in the last run-through of the former to try the next tasks so it may be easier to look at the latter).

Thanks!
Emma

comment:10 Changed 2 months ago by ros

  • Cc ros added

Hi Emma,

Sorry, somehow in the ARCHER2 fog I managed to miss out the polling routine modification so it wasn't filtering out the job ID extensions for the heterogeneous jobs (ID+0, ID+1 etc.) in job poll (query) output. :-(

It seems to be working ok for me now - I'm still stuck in the queue but cylc is getting the submitted status right and not indicating failure. Before I copy it out live can you double check it works ok for you please? You just need to change the job submission method in site/ncas-cray-ex/suite-adds.rc from method = slurm to method = slurm_ros.

Cheers,
Ros.

comment:11 Changed 2 months ago by EmmaHoward

Hi Ros,

I've resubmitted using slurm_ros and that seems have worked: my job is also still in the queue but the status remains as "submitted" when I poll it, but before it changed to "submit-failed" on polling.

Thanks for this!
Emma

comment:12 Changed 2 months ago by EmmaHoward

Hi Ros,

I submitted a job with a 3 hour wall-clock and it ran successfully to completion in 1 hour 40 minutes according to

/work/n02/n02/emmah/cylc-run/u-cc339/work/20151101T0000Z/tm_ra2t_um_fcst1/pe_output/tm.fort6.pe.stdout.

However, the exit signal doesn't seem to have gone through correctly and the log files indicate that the job timed out after 3 hours.

Is this slurm related or something else?

Emma

comment:13 Changed 2 months ago by ros

Hi Emma,

Thanks for letting me know the cylc changes worked. I've now copied them out live so you can go back to using method = slurm when convenient.

As for the exit signal… looking at the cylc log files for that task it looks like SLURM has exited badly.

I would suggest resubmitting it and see if it works now following the maintenance work yesterday.

Regards,
Ros.

Note: See TracTickets for help on using tickets.