Opened 4 years ago

Closed 3 years ago

#1771 closed help (fixed)

crash with time series requests

Reported by: ggxmy Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: ARCHER
UM Version: 8.4


Related to #1745, I increased the number of output locations from 16 to 64 and the job (teafr) crashed.

/home/n02/n02/masara/output/teafr000.teafr.d15344.t151436.leave.20151210-155942 contains messages like this;

???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: check_iostat
? Error Code:  4303
? Error Message:  Error reading namelist domain. Please check input list against code.
? Error generated from processor:     0
? This run generated   5 warnings

Rank 56 [Thu Dec 10 18:05:09 2015] [c1-2c2s5n1] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 56

(many of these lines)

atpAppSigHandler: Back-end never delivered its pid. Re-raising signal.

(many of these lines)

_pmiu_daemon(SIGCHLD): [NID 03597] [c2-2c2s3n1] [Thu Dec 10 18:09:09 2015] PE RANK 97 exit signal Aborted

(many of these lines)

These are not helpful for me. Do these or other messages in the file tell you something? But I went back to UMUI and did "verify diagnostics" in the STASH panel. That says;

You have exceeded the maximum number of timeseries requests
You have requested 168640 timeseries but the limit is 1500:

I suspect this to be the cause of the problem. Do you agree?

In my previous jobs (like teafm), I was requesting 42160 timeseries and the jobs ran OK. What is the current maximum number of timeseries requests?

I may be able to reduce the number of outputs by some although I have to speak with my colleagues. But is it possible to increase the maximum number of timeseries requests? I have increased the total number of requests by adding a hand edit puma:/home/ggxmy/hand_edits/climmean_field_inc_40000.ed. Is there a similar remedy to time series requests?

Thank you.

Change History (4)

comment:1 Changed 4 years ago by simon


Yes, I suspect the fact that you are trying to output more time series diagnostics than the model limit is the cause of the crash. Is there any reason why you're doing this directly via STASH? Could you output the full field as a time series and then extract each point as part of your analysis? This way you wont be limited to the points specified in the STASH window.

comment:2 Changed 4 years ago by ggxmy

Thank you Simon.

We are going to submit an ensemble of over 200 annual simulations. If we include 3 hourly 3-D diagnostics the size of outputs for this ensemble will be very large. I estimated that 5 TB will be required for every 3 hourly 3-D diagnostic and so we can include only one or two of these to keep the size within the disk space we have requested on JASMIN for this project. We will buy a disk locally as well for backup but the cost of this will be a limiting factor as well.


comment:3 Changed 4 years ago by simon


I've had a look at the model code and you are far exceeding the maximum number of time series requests it is currently set up for. Each model level is treated as a separate time series. Therefore in your domain DTHET_LD you have 64*85 ie 5440 different time series, the model maximum is 250 per domain profile. You have a total of 31 STASH items using that domain, so that's 31*64*85, 168640 different STASH items, the maximum allowable in the code is 1500. I suspect previously the model worked, but only wrote first 1500 items. Increasing the number of points in the domain STASH window caused a model error in reading in the actual domain namelist as it became too long.

Do you require all the levels for all fields?

The maximum allowable number of time series could be increased in the code, but this could have other unforeseen issues down the line.


comment:4 Changed 3 years ago by simon

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.