Opened 3 years ago

Closed 3 years ago

#2138 closed help (fixed)

rose suite "stalled"

Reported by: ambrogio Owned by: annette
Component: FCM Keywords: cce gcom loadcomp .profile
Cc: Platform: ARCHER
UM Version: 10.5

Description

I am starting to run the idealised UM Vn10.5 (periodic channel with baroclinic lifecycles) with Rose and I'm having an issue that I don't understand.

The job fails while running fcm_make2 and the error message is the one shown here:

WARNING - suite stalled

Do you know what could be the cause of this?

The suite ID is u-al094 and my username is ambrogio for both archer and PUMA (username:ambrogiovolonte for the shared repository)

Many thanks for the attention

Best wishes

Ambrogio

Change History (9)

comment:1 Changed 3 years ago by annette

  • Component changed from UM Model to FCM
  • Keywords cce gcom loadcomp .profile added
  • Owner changed from um_support to annette
  • Status changed from new to assigned

Ambrogio,

If you look in job.err for fcm_make2 there are several errors like this:

[FAIL] ftn-1777 crayftn: ERROR MPP_TRI_SOLVE_EXEC, File = ../../../../cylc-run/u-al094/share/fcm_make/preprocess-atmos/src/um/src/atmosphere/dynamics_solver/mpp_tri_solve_exec.F90, Line = 44, Column = 5 
[FAIL]   File "/work/y07/y07/umshared/gcom/cce8.4.1/gcom6.0/archer_xc30_cce_mpp/build/include/MPL.mod" contains modules and/or submodules.  The compiler being used is older than the compiler that created this file.   The file was created with version 97 from release Unknown.

I think what is going on is that you are building with cce/8.3.7 and it can't handle the default GCOM library for 10.5 which is built with cce/8.4.1.

The reason you are using cce/8.3.7 is that you have this line in your .profile:

loadcomp $TARGET_MC

This was in the standard instructions for running with the UMUI, but I don't think it is needed for the UMUI anymore, and certainly will confuse Rose versions of the model.

I think the best thing to do is remove the loadcomp line from your .profile, then do a full clean build:

rose suite run --new

Alternatively, you could set the compiler version in your suite, to override the settings in your .profile but you would have to do this for every suite.

Annette

comment:2 Changed 3 years ago by ambrogio

Dear Annette,

thanks for your reply. I modified the .profile file as you said and now it builds fine (although fcm_make2 did run for around 70 mins.. is that right?).
Unfortunately now the job is failing while running, giving this message:


2017-04-04T12:14:00Z INFO - [atmos.20000104T0000Z] -setting execution poll timer

for 300 seconds

2017-04-04T12:14:32Z INFO - [atmos.20000104T0000Z] -(current:running)> failed (p
olled)
2017-04-04T12:14:33Z WARNING - suite stalled
2017-04-04T12:14:33Z WARNING - Unmet prerequisites for atmos.20000105T0000Z:
2017-04-04T12:14:33Z WARNING - * atmos.20000104T0000Z succeeded
2017-04-04T12:15:10Z INFO - [atmos.20000104T0000Z] -(current:failed)> failed (po
lled)
2017-04-04T12:15:10Z WARNING - [atmos.20000104T0000Z] -rejecting a message recei
ved while in the failed state:
2017-04-04T12:15:10Z WARNING - [atmos.20000104T0000Z] - failed
2017-04-04T15:14:34Z WARNING - suite timed out after PT3H

Do you know what the problem could be now? Sorry, I might be asking very basic questions but this is the first job I'm running with Rose.

Many thanks for the support

Cheers
Ambrogio

comment:3 Changed 3 years ago by annette

Hi Ambrogio,

Yes the build can be a bit slow on Archer sometimes, and 1-2 hours for the first build is normal.

If you look on rose-bush for your suite, you can see that the atmos has failed.
http://puma.nerc.ac.uk/rose-bush/taskjobs/ambrogio/u-al094

Click on job.err, then near the bottom of the file is the error message:

BUFFOUT: Write Failed: Disk quota exceeded

I have increased your /work disk quota. It takes a couple of hours to go through the system I think, so try to resubmit later.

Annette

comment:4 Changed 3 years ago by ambrogio

Dear Annette, thanks for this. Now the job runs ok.
I have only one other question, related to the walltime for each cycle of my atmos run. I tried to change it modifying manually the file suite.rc as I was getting close to exceed it (in the first run I actually exceeded it, in the others I got "lucky"). The problem is that after having modified it, the job was failing the submission so I had to go back to the inital walltime value. Is it a problem related to a conflict between the time requested and the queue I was in?

Also, i have seen there is a 1-day course on Rose at the beginning of May. Given my situation, I'd like to register. Do you still have spaces?

Many thanks!
Ambrogio

comment:5 Changed 3 years ago by annette

Hi Ambrogio,

You should be able to edit the walltime in the suite.rc file as you described. The only thing I can think is that you may have accidentally introduced a typo? Can you try again and if it still fails, let me know and I will look at your suite files & logs.

There are still spaces on the conversion course. To register, please send a brief email to cms-support AT ncas.ac.uk outlining what UM work you are doing (this is just for our records).

Annette

comment:6 Changed 3 years ago by ambrogio

Dear Annette,

I changed walltime for atmos from 00:20:00 to 01:00:00 and I got "submission failed". The log activity does not say more than that.

Thanks for the attention
Ambrogio

comment:7 Changed 3 years ago by annette

Ambrogio,

Ah this is because you are submitting to the short queue. If you change HPC_QUEUE in rose-suite.conf to be "standard" that should work.

Annette

comment:8 Changed 3 years ago by ambrogio

Dear Annette,

It works fine now, thank you for all the support!

Best wishes
Ambrogio

Note: I've sent an email to cms-support… to ask for a place at the 1-day Rose course. Should I wait for a confirmation?

comment:9 Changed 3 years ago by annette

  • Resolution set to fixed
  • Status changed from assigned to closed

Ambrogio,

That's great I will close the ticket. I have answered your course query by email.

Annette

Note: See TracTickets for help on using tickets.