Opened 5 years ago

Closed 5 years ago

#1446 closed help (fixed)

Run failing, no error message

Reported by: webber24 Owned by: annette
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 8.4

Description

Dear CMS,

The job xkyor, which I submitted last night and this morning has failed to complete twice, but when I look in pe_output on archer:/work/n02/n02/webber24/xkyor and in my .leave files there is no mention of the word error or fail. Do you have any ideas as to why this could be failing?

Best,

Chris

Change History (9)

comment:1 Changed 5 years ago by annette

Hi Chris,

The leave file indicates that it was pe 48 that crashed.

Rank 48 [Fri Jan 23 11:00:29 2015] [c5-2c1s10n3] application called MPI_Abort(comm=0xC4000002, 9) - process 48
_pmiu_daemon(SIGCHLD): [NID 04139] [c5-2c1s10n3] [Fri Jan 23 11:00:37 2015] PE RANK 48 exit signal Aborted

So having a look at file 48 in pe_output shows an error message:

????????????????????????????????????????????????????????????????????????????????
??!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!??
? Error in routine: cmps_all
? Error Code:     2
? Error Message: COEX: Unable to WGDOS pack to this accuracy
? Error generated from processor:    48
? This run generated 549 warnings
????????????????????????????????????????????????????????????????????????????????

This can be caused by NaNs? in the data due to numerical instabilities or errors with input data.

It is worth checking that your start dump looks OK, and doesn't contain NaNs? (you can do this by cumf-ing the file with itself).

And look through the warnings and diagnostic messages to see if anything is going awry.

Annette

comment:2 Changed 5 years ago by annette

  • Owner changed from um_support to annette
  • Status changed from new to assigned

comment:3 Changed 5 years ago by webber24

Hi Annette,

Thanks for your help, but I've looked through all of the changes I've made since the last run that successfully ran and I can't see an error. I have checked the input files that I changed and there are no NaNs?. Furthermore I am now trying to output Theta on PV2 field, which was giving some domain errors, which have now gone (these showed up on the pe_output 48 file).

Any Further Ideas?

Chris

comment:4 Changed 5 years ago by webber24

Just came across what I believe could be causing the error, my ancillary files in

user single_level ancillary file & fields

all show large negative values in the .astart file in my work/n02/n02/webber24 directory.
This is contrary to the values you see if you xconv the ancillary files from source, which give respectable values. What could be causing this overwrite?

Many Thanks,

Chris

comment:5 Changed 5 years ago by annette

Hi Chris,

I'm looking at this now - I'll get back to you shortly.

Annette

comment:6 Changed 5 years ago by webber24

Hi Annette,

I've found the issue, I have jobs now happily running when I remove one of the stash fields from he UPA usage profile. The culprit is Theta on PV2 and it seems to be an error visible in the start dumps of job xkyoo (the same job run without Theta-PV2 field). The output for this field is erratic with one huge outlier. I believe it was this that was causing the instabilities and my next question was going to be, if you knew a way of stably outputting this field.

Chris

comment:7 Changed 5 years ago by annette

Chris,

The only thing I can think of is to switching packing off for this stream.

As to your user single-level ancillaries, I noticed a warning in the reconfiguration that your ancillaries use a 360-day calendar rather than a 365-day one, but I don't know if that would have caused the problem.

Annette

comment:8 Changed 5 years ago by webber24

Thanks Annette,

I have a feeling why theta-PV2 is not working and it may be something to do with an alteration I had to make to fix a bug for nudging in vn8.4. The next release job is imminent with the bug fixed I have been assured, so I guess I will have to wait until it is. I am not sure what you mean by switching packing off for this stream though? Do you mean to exclude this field from being output?

Many Thanks,

Chris

comment:9 Changed 5 years ago by annette

  • Resolution set to fixed
  • Status changed from assigned to closed

Chris,

I think I misunderstood what you were asking - packing is to do with writing the fields out, rather than calculating them.

Since you now seem to have got over your crash, I will close the ticket. But do get in touch if you have further questions.

Annette

Note: See TracTickets for help on using tickets.