Opened 3 years ago

Closed 3 years ago

#1989 closed help (fixed)

Error from routine file_manager: All units are in use, cannot provide unit

Reported by: marcus Owned by: um_support
Component: UM Model Keywords: file_manager, logical unit
Cc: Platform: ARCHER
UM Version: 10.4

Description

Hi, in order to reduce queue time I have configured suite u-ag542 such that it does use annual cycles with wallclock time of 24 hours, i.e. CLOCK='24:00:00' and RESUB='P1Y'. Otherwise it is an exact copy of nudged suite u-ag268 which ran without problems.

Some time during the 5th month of the integration the model stops with the following error:

Error code: 1
Error from routine: file_manager:get_file_unit
Error message: All units are in use, cannot provide unit
Error from processor: 186
Error number: 20

A search in old tickets hasn't shown me any solution to this.
Please what can I do?

Many thanks,
Marcus

Change History (11)

comment:1 Changed 3 years ago by marcus

  • priority changed from normal to high

comment:2 Changed 3 years ago by willie

Hi Marcus,

This message is produced by the IO services file manager. I'm not sure what it means, but you may be able to get more information by repeating the run with an increased level of IO server verbosity. This can be done in namelist > Io System Settings > IO server > ios_verbosity.

A few lines above is the ios_unit_alloc_policy, currently set to one. Perhaps choosing a dynamic policy like 3 might help?

Regards
Willie

comment:3 Changed 3 years ago by marcus

Hi Willie,

Thanks for your suggestions. I have re-run u-ag542 first with increased IO server verbosity and then with ios_unit_alloc_policy to level 3 (dynamic policy) but no success in either case. The error persists, the model crashes again and I cannot find any more helpful diagnostic output in the .out and .err files.

I also have been running u-ag541 with a 4 months periodic submission cycle. It ran two 4-months cycles without problems but during the third cycle (starting 05/1989) it crashed with a similar error message, i.e. the IO service manager could not provide any logical units.

Could it be that perhaps a file close statement will not release the logical unit associated with the file (or, worse, perhaps the file even stays open?) so that everytime the file gets opened another logical channel gets used up without ever being released? This could actually happen anywhere in the model code. I haven't changed the code myself apart from switching from free running to nudged configuration.

What would you suggest I could do next?

Regards,
Marcus

comment:4 Changed 3 years ago by willie

Hi Marcus,

If it takes 2:10 to do one month, then it will take more than 24 hours to do one year. This is greater than the queue length.

It might be better to do 10 month chunks, so leave CLOCK at 24 hours, and change RESUB to P10M.

Willie

comment:5 Changed 3 years ago by marcus

Hi Willie,

Yes, I can understand that this would look odd. In fact the model needs between 1:50 to 2:05 hrs wall clock time per month. During a one-year integration the average falls below 2 hrs so it should even out over time. I can adjust that if it doesn't work.

But at present the model crashes after a few months with the IO service manager error so once this is fixed I will find out whether 24 hrs is enough for 12 months.

Regards,
Marcus

comment:6 Changed 3 years ago by marcus

Hi, I am really stuck with this. I am trying to figure out how files are opened and closed with the file manager, but I am struggling to understand how this works and what's happening in the code.

A simple grep command shows me that get_file_unit is not used in so many places in the model code. I find it being called in only 12 files:

./atmosphere/lbc_input/inbounda.F90
./atmosphere/radiation_control/lw_rad_input_mod.F90
./atmosphere/radiation_control/sw_rad_input_mod.F90
./utility/makebc/intf_ctl.F90
./control/top_level/up_bound.F90
./control/top_level/rdbasis.F90
./control/top_level/readhist.F90
./control/top_level/readlsta.F90
./control/top_level/wstlst.F90
./control/top_level/readcntl.F90
./control/misc/diagdesc.F90
./control/ancillaries/inancila.F90

I have been trying to understand file_manager.F90 however there are no extensive comments in the code.

In subroutine get_file_unit there is a logical switch which prohibits the re-use of a previously called unit number. But is this used anywhere?

comment:7 Changed 3 years ago by jeff

Hi Marcus

Routine get_file_unit is called by assign_file_unit also in file_manager.F90. For a portio file handler unique=.TRUE. so unit numbers are not reused which is why you eventually run out. In UM vn10.5 this restriction has been removed so you could try making a branch to change to unique=.FALSE.. I'm not sure why this restriction is needed or if it's even necessary.

Another way to get around the unit number problem is to change line

INTEGER, PARAMETER :: end_unit_portio = 300

in file_manager.F90 and increase this value.

Jeff.

comment:8 Changed 3 years ago by marcus

Hi Jeff,

Thank you for this suggestion. I will create a new branch and see if this solves my problem.

Many thanks,
Marcus

comment:9 Changed 3 years ago by jeff

  • Status changed from new to pending

Hi Marcus

Did this help?

Jeff.

comment:10 Changed 3 years ago by marcus

Hi Jeff,

Yes, the model ran now for several six month cycles without error. I have not managed to look at the model output yet but the error has disappeared and the model seems to have run fine by applying the unique=.False. switch.

I think this ticket could be closed.

Many thanks for your help,
Marcus

comment:11 Changed 3 years ago by jeff

  • Resolution set to fixed
  • Status changed from pending to closed
Note: See TracTickets for help on using tickets.