Opened 3 years ago
Closed 3 years ago
#2460 closed help (fixed)
Porting MONSooN suite to Archer
Reported by: | admg26 | Owned by: | luke |
---|---|---|---|
Component: | UKCA | Keywords: | MONSooN, Archer, porting |
Cc: | Platform: | ARCHER | |
UM Version: | 10.9 |
Description
Hello,
I am running a UKCA job on MONSooN which I would like to port to Archer and was wondering if it had been done already for my configuration.
Here are the details:
u-av541
GA7.1 N96 UM10.9 Atmosphere only run.
I have found the document on porting jobs to Archer and I am in the process of trying this.
Cheers,
Alison
Change History (32)
comment:1 Changed 3 years ago by admg26
comment:2 Changed 3 years ago by luke
- Owner changed from um_support to luke
- Status changed from new to accepted
These don't seem to have been copied to the UKCA space on ARCHER.
I'm copying these across now, and I'll let you know the ARCHER paths when they are in place.
Thanks,
Luke
comment:3 Changed 3 years ago by admg26
Hi,
Thanks! I am digging around and realise I also need the initial dump file from the Met Office:
/projects/ocean/hadgem3/initial/atmos/N96L85/ab642a.da19880901_00
Cheers,
Alison
comment:4 Changed 3 years ago by ros
Hi Alison,
The atmos dump is on ARCHER under /work/y07/y07/umshared/hadgem3/initial/atmos/N96L85.
Cheers,
Ros.
comment:5 Changed 3 years ago by luke
Hi Alison,
The dump file is available here:
/work/n02/n02/ukca/initial/N96eL85/aj670a.da20080901_00
with the SST and sea-ice files here:
/work/n02/n02/ukca/ancil/n96e/sstice/
Thanks,
Luke
comment:6 Changed 3 years ago by admg26
Hi Ros,
I am having a generic archer login problem. fcm_make_um fails with:
... [FAIL] [WARN] login8.archer.ac.uk: (ssh failed) [FAIL] [FAIL] No hosts selected. Received signal ERR
ssh-agent is running and I can ssh into archer without a password.
I have run
~um/um-training/setup-archer-hosts
The following works:
admg26@puma:/home/admg26/roses> rose host-select archer login1.archer.ac.uk
I am confused.
Cheers,
Alison
comment:7 Changed 3 years ago by grenville
Alison - as a stop gap, try changing
host = $(rose host-select archer)
to
host = login1.archer.ac.uk
in archer.rc
Grenville
comment:8 Changed 3 years ago by admg26
Hi,
Login1.archer.ac.uk worked but it has now failed with disk quota exceeded. This is probably because I am using
export SCRATCH=/export/puma/data-01/training/$USER
Could I get a /export/puma/data-01/admg26 directory please?
Alison
comment:9 Changed 3 years ago by admg26
Actually. I am not sure which disk quote I am exceeding. It is failing to find space to tar a log file.
comment:10 Changed 3 years ago by admg26
Ah it is my home directory. Please ignore my stream of consciousness.
comment:11 Changed 3 years ago by admg26
Hello,
Suite: u-ax600
Reconfiguration is currently failing with
[FAIL] namelist:items(51075233)=ancilfilename: CHEM_INIT_FILE: unbound variable [FAIL] namelist:run_ukca=ukca_em_files: CMIP6_CHEM_EMS: unbound variable
Both of these variables are set in my suite's site/archer.rc file.
Cheers,
Alison
comment:12 Changed 3 years ago by grenville
Alison
You need to have environment variables - something like
PLATFORM = cce
UMDIR = /work/y07/y07/umshared
CHEM_INIT_FILE = CHEM_INIT_FILE
CMIP6_CHEM_EMS = CMIP6_CHEM_EMS
in archer.rc, so that $CHEM_INIT_FILE is resolved in /app/um/rose-app.conf, but why not just specify CHEM_INIT_FILE in rose-app.conf?
Grenville
comment:13 Changed 3 years ago by admg26
Hi,
Thanks Grenville. I've added them to the environment.
why not just specify CHEM_INIT_FILE in rose-app.conf?
I guess I would like to be able to move the suite back to MONSOON and have it still work.
Recon failed with the error
Error Message:- No Field Calculations specified for section 34
For some reason, Section 34, Item 75 (DMSO mass mixing ratio after TSTEP) appeared in um> Reconfiguration and Anc.. > Configure ancils and .. and had Source 4 (Initialise field via recon calculation routines). I've ignored this entry since this was not in my original suite on MONSooN,
Recon now works.
atmos_main is currently failing with
_pmiu_daemon(SIGCHLD): [NID 02950] [c7-1c1s1n2] [Tue May 15 09:00:21 2018] PE RANK 195 exit signal Aborted _pmiu_daemon(SIGCHLD): [NID 02953] [c7-1c1s2n1] [Tue May 15 09:00:21 2018] PE RANK 281 exit signal Aborted [NID 02952] 2018-05-15 10:00:21 Apid 30858921: initiated application termination [FAIL] um-atmos # return-code=137 Received signal ERR cylc (scheduler - 2018-05-15T09:00:26Z): CRITICAL Task job script received signal ERR at 2018-05-15T09:00:26Z cylc (scheduler - 2018-05-15T09:00:26Z): CRITICAL failed at 2018-05-15T09:00:26Z
Cheers,
Alison
comment:14 Changed 3 years ago by grenville
Alison
Stopping this quickly is usually indicative of a problem with start data or ancillary file - please switch on extra diagnostic messages & run again:
in um→namelist→IO System Settings→ Print Manager XControl
set
prnt_writers to " All tasks write"
prnt_force_flush to "true"
in um→env→Runtime Controls→Atmosphere only
set PRINT_STATUS to "Extra diagnostic messages"
Grenville
comment:15 Changed 3 years ago by admg26
Hello,
I turned on the extra error messages and the suite has just failed again with this in job.out.
conv CALL 1 has 7900 convecting points -------------------------------------------------------------------------------- Resources requested: ncpus=432,place=free,walltime=02:40:00 Resources allocated: cpupercent=0,cput=00:00:15,mem=41880kb,ncpus=432,vmem=327440kb,walltime=00:00:45 *** admg26 Job: 5336020.sdb ended: 17/05/18 10:37:20 queue: standard *** *** admg26 Job: 5336020.sdb ended: 17/05/18 10:37:20 queue: standard *** *** admg26 Job: 5336020.sdb ended: 17/05/18 10:37:20 queue: standard *** *** admg26 Job: 5336020.sdb ended: 17/05/18 10:37:20 queue: standard *** --------------------------------------------------------------------------------
Further details in /home/admg26/cylc-run/u-ax600
Could this be a problem with me accessing resources on Archer?
I am running in the standard queue with the isolice account group.
Cheers,
Alison
comment:16 Changed 3 years ago by grenville
Alison
Does this run without your photolysis branch? It looks like there is a memory management problem.
Grenville
comment:17 Changed 3 years ago by grenville
Alison
Please add
ATP_ENABLED = 1
in the environment section (same as in comment 13 above) and rerun.
Grenville
comment:18 Changed 3 years ago by admg26
Hi,
I am just waiting to see if it runs without photolysis. Will then add the ATP_ENABLED and run again.
Cheers,
Alison
comment:19 Changed 3 years ago by admg26
Hi,
It still crashes with photolysis off. I turned it back on to FAST-JX and added ATP_ENABLED = 1 to archer.rc. This is the last part of job.err
Application 30903188 is crashing. ATP analysis proceeding... ATP Stack walkback for Rank 49 starting: _start@start.S:113 __libc_start_main@libc-start.c:242 main@um_main.F90:20 main@um_main.F90:20 um_shell_@um_shell.F90:652 u_model_4a_@u_model_4A.F90:370 atm_step_4a_@atm_step_4A.F90:3135 atmos_physics2_@atmos_physics2.F90:3917 ni_conv_ctl$ni_conv_ctl_mod_@ni_conv_ctl.F90:2164 _cray$mt_execute_parallel_with_proc_bind@0x2f487f1 _cray$mt_start_one_code_parallel(int, omp_proc_bind_t, void (*)(), void*, long*, long)@0x2f47109 ni_conv_ctl$ni_conv_ctl_mod__cray$mt$p0012@ni_conv_ctl.F90:2291 glue_conv_6a$glue_conv_6a_mod_@glue_conv-6a.F90:4865 _DEALLOC_POLYMORPHIC.part.0@0x2d03c73 dealloc_cpnts@0x2d03e51 _DEALLOC@0x2cf85fd free@0x332af10 (anonymous namespace)::InvalidFree(void*)@0x3286f4d TCMalloc_CrashReporter::PrintfAndDie(char const*, ...)@0x328a390 TCMalloc_CRASH_internal(bool, char const*, int, char const*, __va_list_tag*)@0x328a0f5 abort@abort.c:92 raise@pt-raise.c:42 ATP Stack walkback for Rank 49 done Process died with signal 6: 'Aborted' Forcing core dumps of ranks 49, 0 View application merged backtrace tree with: stat-view atpMergedBT.dot You may need to: module load stat _pmiu_daemon(SIGCHLD): [NID 02101] [c2-1c2s13n1] [Fri May 18 16:11:09 2018] PE RANK 274 exit signal Killed _pmiu_daemon(SIGCHLD): [NID 02137] [c3-1c0s6n1] [Fri May 18 16:11:09 2018] PE RANK 386 exit signal Killed _pmiu_daemon(SIGCHLD): [NID 02014] [c2-1c1s7n2] [Fri May 18 16:11:09 2018] PE RANK 144 exit signal Killed _pmiu_daemon(SIGCHLD): [NID 01887] [c1-1c2s7n3] [Fri May 18 16:11:09 2018] PE RANK 72 exit signal Killed [NID 02101] 2018-05-18 17:11:09 Apid 30903188: initiated application termination [FAIL] um-atmos # return-code=137 Received signal ERR cylc (scheduler - 2018-05-18T16:11:15Z): CRITICAL Task job script received signal ERR at 2018-05-18T16:11:15Z cylc (scheduler - 2018-05-18T16:11:15Z): CRITICAL failed at 2018-05-18T16:11:15Z
comment:20 Changed 3 years ago by grenville
Hi Alison
That's more useful - I'll try to run it.
Can you confirm that this exact model ran OK on Monsoon?
Grenville
comment:21 Changed 3 years ago by admg26
Hi,
Yes. The suite is a copy of u-av541 which runs fine on Monsoon. Both suites are making use of the same code at
branches/dev/alisonming/vn10.9_cl2_photolysis@HEAD
comment:22 Changed 3 years ago by grenville
Have you got the cylc-run directory for the successful run - or can you point me to it?
comment:23 Changed 3 years ago by admg26
On Monsoon:
/home/d01/almin/cylc-run/u-av541/
Cheers,
Alison
comment:24 Changed 3 years ago by jeff
Hi Alison
I've been looking at your DEALLOC problem, it looks like the crash happens when the code tries to deallocate a zero sized allocated object. As far as I know this shouldn't be a problem and is perfectly legal Fortran, but there may be a problem with the particular Cray compiler version being used. As a work around I have modified routine glue_conv-6a.F90, see this working copy on puma, /home/jeff/um/vn10.9_debug. The fcm diff from this is
puma:jeff$ fcm diff Index: atmosphere/convection/glue_conv-6a.F90 =================================================================== --- atmosphere/convection/glue_conv-6a.F90 (revision 55019) +++ atmosphere/convection/glue_conv-6a.F90 (working copy) @@ -1809,7 +1809,9 @@ ! with different sizes for deep, shallow and mid. ! Note: this is allocated in all runs since the calls to convection ! routines below require the outer object to be indexed -ALLOCATE( scm_convss_dg_c( MAX(n_dp, n_sh, n_md) ) ) +IF (MAX(n_dp, n_sh, n_md) > 0) THEN + ALLOCATE( scm_convss_dg_c( MAX(n_dp, n_sh, n_md) ) ) +END IF IF (l_scm_convss_dg) THEN @@ -4860,7 +4862,9 @@ END IF ! Deallocate the array of structures -DEALLOCATE( scm_convss_dg_c ) +IF (MAX(n_dp, n_sh, n_md) > 0) THEN + DEALLOCATE( scm_convss_dg_c ) +END IF !-----------------------------------------------------------------------
Either create a new branch with this in or add it to your branch.
Jeff.
comment:25 Changed 3 years ago by admg26
Hi Jeff,
Thank you for the fix. I have changed in my branch. Will just check that things run.
Cheers,
Alison
comment:26 Changed 3 years ago by admg26
Hello,
I am having some disk quota issues. I have deleted my log files. Would it be possible to have slightly more quota in my home directory please?
Cheers,
Alison
comment:27 Changed 3 years ago by ros
Hi Alison,
I assume this is on PUMA and have increased your quota.
Cheers,
Ros.
comment:28 Changed 3 years ago by admg26
Hi,
Thank you for the increase quota. I thought it was on Puma too but I think I am wrong. Where is my pe_output being written to? I cleared the cylc directory on both puma and archer and re-ran. I am still getting this error message
sys-122 : UNRECOVERABLE error on system request Disk quota exceeded Encountered during an I/O operation on unit 6 Fortran unit 6 is connected to a sequential formatted text file: "pe_output/ax600.fort6.pe047" Application 31043683 is crashing. ATP analysis proceeding...
Cheers,
Alison
comment:29 Changed 3 years ago by ros
Hi Alison,
I've increased your ARCHER quota.
Cheers,
Ros.
comment:30 Changed 3 years ago by admg26
Hi,
That seems to have done the trick! It is running!
Cheers,
Alison
comment:31 Changed 3 years ago by admg26
Thank you for all the help. The model appears to be running okay on Archer. I am happy for this thread to be closed.
Cheers,
Alison
comment:32 Changed 3 years ago by luke
- Resolution set to fixed
- Status changed from accepted to closed
Many thanks Alison. Please open a new ticket for any further issues.
Best wishes,
Luke
Hello,
I am copying my emission files over. I need a chemistry init file and an sst ancil file that are located on the Met Office computer:
/projects/ukca-admin/inputs/initial/N96eL85/aj670a.da20080901_00
/projects/ukca-admin/inputs/ancil/n96e/sstice/sice_clim_1996-2005_360d.n96e and sst_clim_1996-2005_360d.n96e
I was wondering if these had already been copied over to Archer please. I had a look in /work/y07/y07/umshared/ but cannot see them.
Cheers,
Alison