Opened 3 weeks ago

Last modified 2 weeks ago

#2661 new help

Submitting batch scripts via exvmsrose

Reported by: m.couldrey Owned by: um_support
Priority: normal Component: UM Model
Keywords: Cc:
Platform: Monsoon2 UM Version:

Description

My aim is to try and run HadGEM3 GC3.1 on the xcs-c machine, but I would like to ask for some help doing so. I’ve found a suite (u-ar766) that seems like a starting point for what I’d like to run, and running my version of the suite (u-az004) in simulation mode on exvmsrose returns no errors, but I’m not sure how to go about running the suite on xcs-c. I believe the run was originally performed on Archer, although the suite does include options for using xcs.


When I try to run the suite on xcs-c, it fails at the submit stage: the suite makes a job script for fcm_make_drivers that includes some SBATCH commands. The cylc gui tells me that the suite runs up until a submit-fail on the fcm_make_drivers job being run on the local host (which I suppose is exvmsrose). The job-activity.log seems to suggest that the sbatch command isn’t understood:


2018-10-31T11:27:00Z|18500101T0000Z/fcm_make_drivers/01|[STDERR] [Errno 2] No such file or directory: 'sbatch'[TASK JOB SUMMARY]2018-10-31T11:27:00Z|18500101T0000Z/fcm_make_ocean/01|1|


I had a look around in the site configuration file the cray machine (meto_cray.rc) and saw that the batch system for EXTRACT_RESOURCE is set to ‘slurm’, and wondered if the ‘no such file or directory’ message is caused because exvmsrose doesn’t use slurm. On the met-office pages I see that Monsoon2 uses a PBS batch system, but setting EXTRACT_RESOURCE in the meto_cray.rc site file to pbs leaves me with a similar problem: the localhost doesn’t seem to accept PBS commands either.


I’m not really sure how to approach this problem. I see that rose suites should be submitted through the exvmsrose machine, but I’m not sure how batch scripts for jobs like fcm_make_drivers are meant to be submitted if exvmsrose doesn’t recognise batch system commands. Or am I misunderstanding what’s going on completely?


Many thanks for any help with this

Matt

Change History (1)

comment:1 Changed 2 weeks ago by grenville

Matt

n02-FAFMIP is an ARCHER code - you should use nexcs-n02. It also didn't like 'climate' (see /home/d00/macou/cylc-run/u-az004/log/job/18500101T0000Z/fcm_make_um/01/job-activity.log for example)

[STDERR] qsub: error: [PBSInvalidProject] 'climate' is not valid for collaboration trustzone on XCS

I got round that by selecting "Use default account" as true - setting sub-project name to "Other", setting Other subproject name to nexcs-n02.

However, why it's using slurm to make the drivers and postproc I don't know (they must use slurm at the MO) - look in …site/meto_cray.rc where you'll need to change all reference to slurm to pbs and you'll need to use the appropriate pbs directives — it might have been better to have chosen a job with a MONSooN.rc site file

Grenville

Note: See TracTickets for help on using tickets.