Opened 6 weeks ago

Closed 2 days ago

#2600 closed help (answered)

Can't run after extracting from puma

Reported by: ChrisWells Owned by: willie
Priority: normal Component: UM Model
Keywords: OASIS, NEMO, CICE Cc:
Platform: Monsoon2 UM Version: 8.2

Description

Hi,

Related to http://cms.ncas.ac.uk/ticket/2599

The submission extracted the files over to xcs, but then, in
/home/d00/chwel/output/xodta000.xodta.d18248.t151944.comp.leave ,
errors begin at line 28852:

"
/include -INetCDFmodule -I/projects/um1/grib_api/cce-8.3.4/1.13.0/include -c /projects/ukca-imp/chwel/xodta/umatmos/ppsrc/UM/control/mpp/sterr_mod.f90

ftn-855 crayftn: ERROR IOS_MPI_ERROR_HANDLERS, File = ../../../../../projects/ukca-imp/chwel/xodta/umatmos/ppsrc/UM/io_services/common/ios_mpi_error_handlers.f90, Line = 22, Column = 8

The compiler has detected errors in module "IOS_MPI_ERROR_HANDLERS". No module information file will be created for this module.

"

The file
projects/ukca-imp/chwel/xodta/umatmos/ppsrc/UM/io_services/common/ios_mpi_error_handlers.f90
line 22 is

"
MODULE ios_mpi_error_handlers
"

so presumably column 7 is the i in ios, indicating that this module is the problem?
I'm unsure what this is about - do you know how I can get around this?

Many thanks,
Chris

Attachments (1)

OASIS_conf (3.6 KB) - added by willie 8 days ago.

Download all attachments as: .zip

Change History (26)

comment:1 Changed 6 weeks ago by willie

  • Platform set to Monsoon2
  • UM Version set to 8.2

Hi Chris,

It's failing in the build. Try changing,

$UM_SVN_BIND/container.cfg@20056

to

$UM_SVN_BIND/container.cfg@vn8.2_cfg

in the FCM extract and build page and try again.

Regards
Willie

comment:2 Changed 6 weeks ago by ChrisWells

Hi Willie,

I tried this: I changed in the UMUI under

FCM configuration → FCM extract directories and Output levels → Container file name and location (UM_CONTAINER)

which had $UM_SVN_BIND/container.cfg@20056 in it, and I changed to $UM_SVN_BIND/container.cfg@vn8.2_cfg

I then did Process and Submit again, but got a similar result - the same type of message, but much earlier in the file (line ~2500)

"
touch /projects/ukca-imp/chwel/xodta/umatmos/done/filenamelength_mod.done

ftn-855 crayftn: ERROR IOS_MPI_ERROR_HANDLERS, File = ../../../../../projects/ukca-imp/chwel/xodta/umatmos/ppsrc/UM/io_services/common/ios_mpi_error_handlers.f90, Line = 22, Column = 8

The compiler has detected errors in module "IOS_MPI_ERROR_HANDLERS". No module information file will be created for this module.

"

Did I follow the instructions correctly (This is my 1st time using PUMA)? Or is it something else?

Many thanks,
Chris

comment:3 Changed 6 weeks ago by willie

Hi Chris,

OK. Could you now add the branch

fcm:um_br/dev/willie/vn8.2_um_crun_ios/src

and switch it on and then switch off the

fcm:um_br/pkg/Config/vn8.2_ncas/src

branch and try again.

Regards

Willie

comment:4 Changed 5 weeks ago by ChrisWells

Hi Willie,

I think I followed that correctly - I went in the UMUI for the job, in FCM options for UM Atmosphere and Reconfiguration, and turned the branch to switch off from Y to N, and added that new branch and set it to Y, then re-processed and submitted, but get the following same error:

"
/build/include -INetCDFmodule -I/projects/um1/grib_api/cce-8.3.4/1.13.0/include -c /projects/ukca-imp/chwel/xodta/umatmos/ppsrc/jules/src/science/params/nstypes.f90

ftn-855 crayftn: ERROR IOS_MPI_ERROR_HANDLERS, File = ../../../../../projects/ukca-imp/chwel/xodta/umatmos/ppsrc/UM/io_services/common/ios_mpi_error_handlers.f90, Line = 22, Column = 8

The compiler has detected errors in module "IOS_MPI_ERROR_HANDLERS". No module information file will be created for this module.

"

Is there more I should change to get it to work?

Many thanks,
Chris

comment:5 Changed 5 weeks ago by willie

Hi Chris,

The major problem is that all the files under /projects/umadmin/ksival have vanished. This is affecting the Oasis build and the gcom path. These are set in the UMUI Time convention and SCRIPT environment page and on the UM user override page.

Originally it used gcom4.7 but I have got it to compile at least with a gcom path of /projects/um1/gcom/gcom4.6/meto_cray_xc40_mpp/build/include.

You'll need to find an alternative to

OASIS_BLDS=/projects/umadmin/ksival/oasis/oasis3_3

I'm afraid I can't help you there.

Also on the links to the NEMO model page, jwalto needs to become jwalton, I think.

Regards
Willie

comment:6 Changed 4 weeks ago by ChrisWells

Hi Willie,

Thanks for the info. I found a copy of the deleted folder in a subfolder of umadmin (gmslis/), and changed the paths to point there, also changing jwalto → jwalton.

I ran it, and it seemed to get further – it made 2 files in /output, one which seems to have reported no errors (the .comp.leave file), and one (.rcf.leave) with this error:

"
RCF Executable : /projects/ukca-imp/chwel/xodta/bin/qxreconf
*

[Tue Sep 18 15:09:33 2018] [c4-2c1s8n0] Fatal error in MPI_Init: Other MPI error, error stack:
MPIR_Init_thread(506):
MPID_Init(192)…….: channel initialization failed
MPID_Init(569)…….: PMI2 init failed: 1
/projects/ukca-imp/chwel/xodta/bin/qsrecon: Error in dump reconfiguration - see OUTPUT
*

Ending script : qsrecon
Completion code : 1
Completion time : Tue Sep 18 15:09:33 GMT 2018

*

/projects/ukca-imp/chwel/xodta/bin/qsmaster: Failed in qsrecon in job xodta
"

I'm unsure what this means - do you know?

Many thanks,
Chris

comment:7 Changed 4 weeks ago by willie

Hi Chris,

Sorry, my fault: you need to switch the vn8.2_ncas branch on and recompile. It will then fail in the reconfiguration because it can't find (look in the pe_output for tine message),

/projects/um1/ancil/atmos/n96/orca1/rivers_trip/sequence/etopo5/v0/qrparm.rivseq.inv

If you delete the .inv from the end, it will find that.

It may be an idea to make a copy of the ksival directory just in case it gets deleted again.

Regards,
WIllie

comment:8 Changed 4 weeks ago by ChrisWells

Hi Willie,

No worries; I made the branch change, but am not sure where I need to edit that filename to remove the ".inv"? I couldn't see it referred to in the Recon pages.

Thanks for the tip on the files- copying over to my folders.

Many thanks,
Chris

comment:9 Changed 4 weeks ago by willie

Hi Chris,

It is specified in Ancillary ... -> Climatologies & potential ... -> River Routing Global Model

Willei

comment:10 Changed 4 weeks ago by ChrisWells

Hi Willie,

Thanks for pointing me to that. I made the change and ran the job, but hit on a couple of more files not found.

One was a jwalto → jwalton change, which I've made, but the other is more involved.

The directory /projects/ocean/hadgem3/oasis_ctl/oasis3 exists, but the subdirectory referred to in the UMUI, /um8.2nemo3.4, doesn't, so the model can't find um8.2nemo3.4/namcouple_1_EXPORTED_3HR

/projects/ocean/hadgem3/oasis_ctl/oasis3 contains lots of namcouple stuff, but nothing identical. I'm not sure what namcouple does. Could something in /projects/ocean/hadgem3/oasis_ctl/oasis3 be a replacement for the missing directory (e.g. namcouple_para_1_EXPORTED_3HR)?

Many thanks,
Chris

comment:11 Changed 3 weeks ago by ChrisWells

Hi Willie,

just wondering if you'd had a chance to look at this "namcouple" query?

Many thanks,
Chris

comment:12 Changed 3 weeks ago by ChrisWells

Hi Willie,

I might have found the file actually,

Cheers,
Chris

comment:13 Changed 3 weeks ago by ChrisWells

Hi Willie,

I found a subfolder in /projects/ocean/hadgem3/oasis_ctl/oasis3 called /um8.2, rather than /um8.2nemo3.4 as expected by the model, and this has a subfolder /namcouple_1_EXPORTED_3HR as expected. I tried this, but got the error

"No OASIS3 angles file will be used"

Do you know what this might mean?

Many thanks,
Chris

comment:14 Changed 2 weeks ago by willie

Hi Chris,

Is that an information message rather than an error? Is it stopping the run?

Willie

comment:15 Changed 2 weeks ago by ChrisWells

Hi Willie,

It stops the run. I'm not sure why it's failing though; it seems to fail after the above error, which is from the file /home/d00/chwel/output/xodta000.xodta.d18274.t143322.leave .

Cheers,
Chris

comment:16 Changed 2 weeks ago by willie

Hi Chris,

That's not the real error. You're getting

apsched: claim exceeds reservation's node-count

and later

ERROR: Expected NEMO output files are not all available.
       This may be a UM / OASIS / NEMO start-up problem.
       The ocean.output file may provide more information.

probably because of the apsched problem. Have you changed the number of processors in UM/CICE/NEMO?

Also the um_archiving script hasn't been found see the archive.leave file.

So there are some basic issues here. Has this ever been run?

Regards
Willie

comment:17 Changed 2 weeks ago by ChrisWells

Hi Willie,

Thanks for the info. I'm trying to run this old um8.2 job from puma. The job was ran on the old XC40 machine (xcml00), and I'm trying to run it on the new one on Monsoon - this job hasn't been run on there before.

I haven't edited the number of processors - is there a set number I should use, different to the old machine?

Cheers,
Chris

comment:18 Changed 11 days ago by willie

Hi Chris,

Have you considered using a more modern HadGEM3-GC3.1 N96 ORCA1 PI Control for CMIP6 suite? There is a Monsoon version u-as930 and an ARCHER version u-as037. These are UM10.7.

For reference the archiving script has been moved to /common/moci/archiving/bin/um_archiving. You'll need a hand edit to modify the SUBMIT script to use this.

Regards
Willie

comment:19 Changed 11 days ago by ChrisWells

Hi Willie,

Thanks for the archiving script info - I'll see if I can use that; I guess I need to edit the submit script after processing to link to that.

The reason I'm trying to run the older version is that I'm following up on an older project that used vn8.2 when it was current. I'm trying to run it with just the BC lifetime updated, so to compare the resulting changes properly it should be run with the same version as the original project.

Cheers,
Chris

comment:20 Changed 10 days ago by willie

  • Keywords OASIS, NEMO, CICE added

Hi Chris,

There are still a large number of missing files. I will try to set them up for you by back copying ARCHER equivalents. This might take a bit of sorting …

Regards
Willie

comment:21 Changed 10 days ago by ChrisWells

Hi Willie,

thanks so much for that and apologies for the hassle. Let me know if there's anything I can do to help.

Many thanks,
Chris

Changed 8 days ago by willie

comment:22 Changed 8 days ago by willie

Hi Chris,

I have now made some progress. This is summarised in my job xoela.

The fundamental problem is that the old UMUI does not cater for the new Monsoon2 computer. It fails to calculate the number of nodes properly and this is the cause of the apsched failure: it tries to use more nodes than was requested in the PBS request. The old xcm had 32 core per node while new Monsoon has 36. So I have rescaled the number of processors for the atmosphere and also for NEMO and CICE. Whenever the NEMO and CICE decomposition is changed, these models must be recompiled.

However, the old sizes are still embedded in the script OASIS_conf which is found in the model $RUNID/bin directory. So this leads to a little trick to get the thing going. Launch the job through the UMUI as usual and let it build and run. It will then fail. At this point copy the attached OASIS_conf on top of the old one. You can then qsub umuisumbit_run from the failed (last) directory in ~/umui_runs.

At this point it runs for about 43 atmosphere time steps and then falls over in CICE with the error

ice: ITD cleanup error in step_therm2

Other changes between xoela and xodta are as follows. Some of the aerosol clim ancils were in the wrong place and I have corrected these. The controlling namcouple file had the wrong name; I replaced it with one in the specified directory.

I am not au fait with the details of running a coupled model, so I am hoping that this error will mean more to you.

Regards
Willie

Last edited 8 days ago by willie (previous) (diff)

comment:23 Changed 8 days ago by willie

Sorry that's model xoela.

comment:24 Changed 8 days ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

comment:25 Changed 2 days ago by willie

  • Resolution set to answered
  • Status changed from accepted to closed

This is continued in ticket #2641

Note: See TracTickets for help on using tickets.