Opened 5 months ago

Closed 4 months ago

#2460 closed help (fixed)

Porting MONSooN suite to Archer

Reported by: admg26 Owned by: luke
Priority: normal Component: UKCA
Keywords: MONSooN, Archer, porting Cc:
Platform: ARCHER UM Version: 10.9

Description

Hello,

I am running a UKCA job on MONSooN which I would like to port to Archer and was wondering if it had been done already for my configuration.

Here are the details:

u-av541

GA7.1 N96 UM10.9 Atmosphere only run.

I have found the document on porting jobs to Archer and I am in the process of trying this.

Cheers,
Alison

Change History (32)

comment:1 Changed 5 months ago by admg26

Hello,

I am copying my emission files over. I need a chemistry init file and an sst ancil file that are located on the Met Office computer:

/projects/ukca-admin/inputs/initial/N96eL85/aj670a.da20080901_00
/projects/ukca-admin/inputs/ancil/n96e/sstice/sice_clim_1996-2005_360d.n96e and sst_clim_1996-2005_360d.n96e

I was wondering if these had already been copied over to Archer please. I had a look in /work/y07/y07/umshared/ but cannot see them.

Cheers,
Alison

Last edited 5 months ago by admg26 (previous) (diff)

comment:2 Changed 5 months ago by luke

  • Owner changed from um_support to luke
  • Status changed from new to accepted

These don't seem to have been copied to the UKCA space on ARCHER.

I'm copying these across now, and I'll let you know the ARCHER paths when they are in place.

Thanks,
Luke

comment:3 Changed 5 months ago by admg26

Hi,

Thanks! I am digging around and realise I also need the initial dump file from the Met Office:

/projects/ocean/hadgem3/initial/atmos/N96L85/ab642a.da19880901_00

Cheers,
Alison

comment:4 Changed 5 months ago by ros

Hi Alison,

The atmos dump is on ARCHER under /work/y07/y07/umshared/hadgem3/initial/atmos/N96L85.

Cheers,
Ros.

comment:5 Changed 5 months ago by luke

Hi Alison,

The dump file is available here:

/work/n02/n02/ukca/initial/N96eL85/aj670a.da20080901_00

with the SST and sea-ice files here:

/work/n02/n02/ukca/ancil/n96e/sstice/

Thanks,
Luke

comment:6 Changed 5 months ago by admg26

Hi Ros,

I am having a generic archer login problem. fcm_make_um fails with:

...
[FAIL] [WARN] login8.archer.ac.uk: (ssh failed)
[FAIL] [FAIL] No hosts selected.
Received signal ERR

ssh-agent is running and I can ssh into archer without a password.

I have run

~um/um-training/setup-archer-hosts 

The following works:

admg26@puma:/home/admg26/roses> rose host-select archer
login1.archer.ac.uk

I am confused.

Cheers,
Alison

comment:7 Changed 5 months ago by grenville

Alison - as a stop gap, try changing

host = $(rose host-select archer)

to

host = login1.archer.ac.uk

in archer.rc

Grenville

comment:8 Changed 5 months ago by admg26

Hi,

Login1.archer.ac.uk worked but it has now failed with disk quota exceeded. This is probably because I am using

export SCRATCH=/export/puma/data-01/training/$USER

Could I get a /export/puma/data-01/admg26 directory please?

Alison

comment:9 Changed 5 months ago by admg26

Actually. I am not sure which disk quote I am exceeding. It is failing to find space to tar a log file.

comment:10 Changed 5 months ago by admg26

Ah it is my home directory. Please ignore my stream of consciousness.

comment:11 Changed 5 months ago by admg26

Hello,

Suite: u-ax600

Reconfiguration is currently failing with

[FAIL] namelist:items(51075233)=ancilfilename: CHEM_INIT_FILE: unbound variable
[FAIL] namelist:run_ukca=ukca_em_files: CMIP6_CHEM_EMS: unbound variable

Both of these variables are set in my suite's site/archer.rc file.

Cheers,
Alison

comment:12 Changed 5 months ago by grenville

Alison

You need to have environment variables - something like

[environment?]

PLATFORM = cce
UMDIR = /work/y07/y07/umshared
CHEM_INIT_FILE = CHEM_INIT_FILE
CMIP6_CHEM_EMS = CMIP6_CHEM_EMS

in archer.rc, so that $CHEM_INIT_FILE is resolved in /app/um/rose-app.conf, but why not just specify CHEM_INIT_FILE in rose-app.conf?

Grenville

comment:13 Changed 5 months ago by admg26

Hi,

Thanks Grenville. I've added them to the environment.

why not just specify CHEM_INIT_FILE in rose-app.conf?

I guess I would like to be able to move the suite back to MONSOON and have it still work.

Recon failed with the error

Error Message:-  No Field Calculations specified for section 34

For some reason, Section 34, Item 75 (DMSO mass mixing ratio after TSTEP) appeared in um> Reconfiguration and Anc.. > Configure ancils and .. and had Source 4 (Initialise field via recon calculation routines). I've ignored this entry since this was not in my original suite on MONSooN,

Recon now works.

atmos_main is currently failing with

_pmiu_daemon(SIGCHLD): [NID 02950] [c7-1c1s1n2] [Tue May 15 09:00:21 2018] PE RANK 195 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 02953] [c7-1c1s2n1] [Tue May 15 09:00:21 2018] PE RANK 281 exit signal Aborted
[NID 02952] 2018-05-15 10:00:21 Apid 30858921: initiated application termination
[FAIL] um-atmos # return-code=137
Received signal ERR
cylc (scheduler - 2018-05-15T09:00:26Z): CRITICAL Task job script received signal ERR at 2018-05-15T09:00:26Z
cylc (scheduler - 2018-05-15T09:00:26Z): CRITICAL failed at 2018-05-15T09:00:26Z

Cheers,
Alison

comment:14 Changed 5 months ago by grenville

Alison

Stopping this quickly is usually indicative of a problem with start data or ancillary file - please switch on extra diagnostic messages & run again:

in um→namelist→IO System Settings→ Print Manager XControl
set
prnt_writers to " All tasks write"
prnt_force_flush to "true"

in um→env→Runtime Controls→Atmosphere only
set PRINT_STATUS to "Extra diagnostic messages"

Grenville

comment:15 Changed 5 months ago by admg26

Hello,

I turned on the extra error messages and the suite has just failed again with this in job.out.

conv CALL 1 has 7900 convecting points
--------------------------------------------------------------------------------

Resources requested: ncpus=432,place=free,walltime=02:40:00
Resources allocated: cpupercent=0,cput=00:00:15,mem=41880kb,ncpus=432,vmem=327440kb,walltime=00:00:45

*** admg26   Job: 5336020.sdb   ended: 17/05/18 10:37:20   queue: standard ***
*** admg26   Job: 5336020.sdb   ended: 17/05/18 10:37:20   queue: standard ***
*** admg26   Job: 5336020.sdb   ended: 17/05/18 10:37:20   queue: standard ***
*** admg26   Job: 5336020.sdb   ended: 17/05/18 10:37:20   queue: standard ***
--------------------------------------------------------------------------------

Further details in /home/admg26/cylc-run/u-ax600

Could this be a problem with me accessing resources on Archer?

I am running in the standard queue with the isolice account group.

Cheers,
Alison

comment:16 Changed 5 months ago by grenville

Alison

Does this run without your photolysis branch? It looks like there is a memory management problem.

Grenville

comment:17 Changed 5 months ago by grenville

Alison

Please add

ATP_ENABLED = 1

in the environment section (same as in comment 13 above) and rerun.

Grenville

comment:18 Changed 5 months ago by admg26

Hi,

I am just waiting to see if it runs without photolysis. Will then add the ATP_ENABLED and run again.

Cheers,
Alison

comment:19 Changed 5 months ago by admg26

Hi,

It still crashes with photolysis off. I turned it back on to FAST-JX and added ATP_ENABLED = 1 to archer.rc. This is the last part of job.err

Application 30903188 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 49 starting:
  _start@start.S:113
  __libc_start_main@libc-start.c:242
  main@um_main.F90:20
  main@um_main.F90:20
  um_shell_@um_shell.F90:652
  u_model_4a_@u_model_4A.F90:370
  atm_step_4a_@atm_step_4A.F90:3135
  atmos_physics2_@atmos_physics2.F90:3917
  ni_conv_ctl$ni_conv_ctl_mod_@ni_conv_ctl.F90:2164
  _cray$mt_execute_parallel_with_proc_bind@0x2f487f1
  _cray$mt_start_one_code_parallel(int, omp_proc_bind_t, void (*)(), void*, long*, long)@0x2f47109
  ni_conv_ctl$ni_conv_ctl_mod__cray$mt$p0012@ni_conv_ctl.F90:2291
  glue_conv_6a$glue_conv_6a_mod_@glue_conv-6a.F90:4865
  _DEALLOC_POLYMORPHIC.part.0@0x2d03c73
  dealloc_cpnts@0x2d03e51
  _DEALLOC@0x2cf85fd
  free@0x332af10
  (anonymous namespace)::InvalidFree(void*)@0x3286f4d
  TCMalloc_CrashReporter::PrintfAndDie(char const*, ...)@0x328a390
  TCMalloc_CRASH_internal(bool, char const*, int, char const*, __va_list_tag*)@0x328a0f5
  abort@abort.c:92
  raise@pt-raise.c:42
ATP Stack walkback for Rank 49 done
Process died with signal 6: 'Aborted'
Forcing core dumps of ranks 49, 0
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

_pmiu_daemon(SIGCHLD): [NID 02101] [c2-1c2s13n1] [Fri May 18 16:11:09 2018] PE RANK 274 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 02137] [c3-1c0s6n1] [Fri May 18 16:11:09 2018] PE RANK 386 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 02014] [c2-1c1s7n2] [Fri May 18 16:11:09 2018] PE RANK 144 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 01887] [c1-1c2s7n3] [Fri May 18 16:11:09 2018] PE RANK 72 exit signal Killed
[NID 02101] 2018-05-18 17:11:09 Apid 30903188: initiated application termination
[FAIL] um-atmos # return-code=137
Received signal ERR
cylc (scheduler - 2018-05-18T16:11:15Z): CRITICAL Task job script received signal ERR at 2018-05-18T16:11:15Z
cylc (scheduler - 2018-05-18T16:11:15Z): CRITICAL failed at 2018-05-18T16:11:15Z

comment:20 Changed 5 months ago by grenville

Hi Alison

That's more useful - I'll try to run it.

Can you confirm that this exact model ran OK on Monsoon?

Grenville

comment:21 Changed 5 months ago by admg26

Hi,

Yes. The suite is a copy of u-av541 which runs fine on Monsoon. Both suites are making use of the same code at

branches/dev/alisonming/vn10.9_cl2_photolysis@HEAD 

comment:22 Changed 5 months ago by grenville

Have you got the cylc-run directory for the successful run - or can you point me to it?

comment:23 Changed 5 months ago by admg26

On Monsoon:

/home/d01/almin/cylc-run/u-av541/

Cheers,
Alison

comment:24 Changed 5 months ago by jeff

Hi Alison

I've been looking at your DEALLOC problem, it looks like the crash happens when the code tries to deallocate a zero sized allocated object. As far as I know this shouldn't be a problem and is perfectly legal Fortran, but there may be a problem with the particular Cray compiler version being used. As a work around I have modified routine glue_conv-6a.F90, see this working copy on puma, /home/jeff/um/vn10.9_debug. The fcm diff from this is

puma:jeff$ fcm diff
Index: atmosphere/convection/glue_conv-6a.F90
===================================================================
--- atmosphere/convection/glue_conv-6a.F90      (revision 55019)
+++ atmosphere/convection/glue_conv-6a.F90      (working copy)
@@ -1809,7 +1809,9 @@
 ! with different sizes for deep, shallow and mid.
 ! Note: this is allocated in all runs since the calls to convection
 ! routines below require the outer object to be indexed
-ALLOCATE( scm_convss_dg_c( MAX(n_dp, n_sh, n_md) ) )
+IF (MAX(n_dp, n_sh, n_md) > 0) THEN
+  ALLOCATE( scm_convss_dg_c( MAX(n_dp, n_sh, n_md) ) )
+END IF
 
 IF (l_scm_convss_dg) THEN
 
@@ -4860,7 +4862,9 @@
 END IF
 
 ! Deallocate the array of structures
-DEALLOCATE( scm_convss_dg_c )
+IF (MAX(n_dp, n_sh, n_md) > 0) THEN
+  DEALLOCATE( scm_convss_dg_c )
+END IF
 
     !-----------------------------------------------------------------------

Either create a new branch with this in or add it to your branch.

Jeff.

comment:25 Changed 5 months ago by admg26

Hi Jeff,

Thank you for the fix. I have changed in my branch. Will just check that things run.

Cheers,
Alison

comment:26 Changed 5 months ago by admg26

Hello,

I am having some disk quota issues. I have deleted my log files. Would it be possible to have slightly more quota in my home directory please?

Cheers,
Alison

comment:27 Changed 5 months ago by ros

Hi Alison,

I assume this is on PUMA and have increased your quota.

Cheers,
Ros.

comment:28 Changed 4 months ago by admg26

Hi,

Thank you for the increase quota. I thought it was on Puma too but I think I am wrong. Where is my pe_output being written to? I cleared the cylc directory on both puma and archer and re-ran. I am still getting this error message

sys-122 : UNRECOVERABLE error on system request 
  Disk quota exceeded

Encountered during an I/O operation on unit 6
Fortran unit 6 is connected to a sequential formatted text file:
  "pe_output/ax600.fort6.pe047"
Application 31043683 is crashing. ATP analysis proceeding...

Cheers,
Alison

comment:29 Changed 4 months ago by ros

Hi Alison,

I've increased your ARCHER quota.

Cheers,
Ros.

comment:30 Changed 4 months ago by admg26

Hi,

That seems to have done the trick! It is running!

Cheers,
Alison

comment:31 Changed 4 months ago by admg26

Thank you for all the help. The model appears to be running okay on Archer. I am happy for this thread to be closed.

Cheers,
Alison

Last edited 4 months ago by admg26 (previous) (diff)

comment:32 Changed 4 months ago by luke

  • Resolution set to fixed
  • Status changed from accepted to closed

Many thanks Alison. Please open a new ticket for any further issues.

Best wishes,
Luke

Note: See TracTickets for help on using tickets.