Opened 2 years ago

Closed 2 years ago

#2238 closed help (fixed)

Global model run on new monsoon system

Reported by: LSaffin Owned by: willie
Component: UM Model Keywords:
Cc: L.Saffin@… Platform: Monsoon2
UM Version: 7.3

Description

I'm trying to run the UM on monsoon. I've taken an existing job (xjjhm) I used on a previous version of monsoon and updated it with the changes described here (http://collab.metoffice.gov.uk/twiki/bin/view/Support/CrayUMInstall#UM7.3).

The model runs for a little while and then produces a segmentation fault in the departure point calculation. The error message in the .leave file (/home/d03/lsaffi/output/xjjhm000.xjjhm.d17215.t094012.leave) is

ATP Stack walkback for Rank 40 starting:

bi_linear_h_@…:561

ATP Stack walkback for Rank 40 done
Process died with signal 11: 'Segmentation fault'
Forcing core dumps of ranks 40, 0, 1, 53

The stash output which the model has managed for the first hour (in /projects/diamet/lsaffi/xjjhm) also has a lot of the fields as NaNs? but not all of them. I'm not sure if this is related.

I'm sure I've probably missed something simple in the update to the new system but I can't figure out what it is.

Thanks,

Leo

Change History (10)

comment:1 Changed 2 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Leo,

The user STASH variable USTASH (in UMUI Time convention and SCRIPT environment page) is now

/home/umui/PS22/Global

Regards
Willie

comment:2 Changed 2 years ago by LSaffin

Hi Willie,

I've run the model again with the new user STASH but it hasn't made a difference to the errors I'm getting.

Leo

comment:3 Changed 2 years ago by LSaffin

  • Cc L.Saffin@… added

comment:4 Changed 2 years ago by willie

Hi Leo,

Of course those instructions are for old Monsoon which had 32 cores per node. New Monsoon, Monsoon2, has 36 cores per node. SO you need to make sure that both reconfiguration and run jobs use a multiple of 36.

I found another reference to a moved file on STASH > User STASHmaster. This needs to be

~umui/userSTASH/STASH_7.3_7.5

Regards
Willie

comment:5 Changed 2 years ago by LSaffin

Hi Willie,

Thanks for the update. I've changed the stash file. I tried running the model again with the reconfiguration and run set to use 12x12 processors but I still get the same errors. I have another job (xjjhq) which has successfully run using 8x16 processors on the new monsoon machine so I don't think that's the issue. Although, I will change that if I run it again.

Leo

comment:6 Changed 2 years ago by willie

Hi Leo,

It worked for me with 8x12 processors. See my job xnona.

Regards
Willie

comment:7 Changed 2 years ago by LSaffin

Hi Willie,

That's worked for me too.

Thanks,

Leo

comment:8 Changed 2 years ago by LSaffin

Hi Willie,

I was aiming to use this job to run the model for a more recent case study; however, the reconfiguration fails which I think is due to changes in the analyses since the update to ENDGame. I have another start dump (/projects/diamet/lsaffi/20160922T0000Z_glm_t+0) but the run fails at the reading of the header when this is used. Are there any standard jobs or edits for using the newer analyses with new dynamics? Or should I start another ticket?

Thanks,

Leo

comment:9 Changed 2 years ago by willie

Hi Leo,

Yes, the old 7.3 jobs won't be able to handle the ENDGAME start dumps. It might be better to move to 10.2+ and use a Rose suite such as Ben Harvey's u-ae294 (I found this by using the Rose search on PV tracers and following the trail).

We prefer to have one problem per ticket, although in reality this is not always possible.

Regards
Willie

comment:10 Changed 2 years ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.