Opened 5 years ago

Closed 5 years ago

#1419 closed help (fixed)

Model crash relating to .xhist / .thist files

Reported by: James Owned by: annette
Component: UM Model Keywords: unit 12, UKCA, file open
Cc: Platform: ARCHER
UM Version: 7.3

Description

I'm running UM-UKCA vn7.3 (JobID: xkpib) and the model's falling over with the following error message written to the .leave file.

—-

sys-2 : UNRECOVERABLE error on system request

No such file or directory

Encountered during an OPEN of unit 12
Fortran unit 12 is not connected
_pmiu_daemon(SIGCHLD): [NID 00484] [c2-0c1s9n0] [Thu Dec 11 03:35:10 2014] PE RANK 0 exit signal Aborted
[NID 00484] 2014-12-11 03:35:10 Apid 12200682: initiated application termination
diff: /work/n02/n02/jgl22/tmp/tmp.mom4.9816/xkpib.xhist: No such file or directory
qsexecute: Copying /work/n02/n02/jgl22/um/xkpib/xkpib.thist to backup thist file /work/n02/n02/jgl22/um/xkpib/xkpib.thist_keep
xkpib: Run failed

—-

It seems the model's compiling ok but falling over close to the start of the run, and I'm at a bit of loss as to what's wrong.

Luke Abraham kindly took a look at the output with me and suggested I raise a ticket. I'd really appreciate your help.

Thanks,

James

Change History (9)

comment:1 Changed 5 years ago by annette

Hi James,

Can you change the permissions on your directories please?

chmod -R g+rX /home/n02/n02/jgl22
chmod -R g+rX /work/n02/n02/jgl22

Then let us know the full path and file name of your .leave file.

Thanks,
Annette

comment:2 Changed 5 years ago by annette

Comment from James:

I've changed the permisions, and the path/name of the .leave file are as follows:

/home/n02/n02/jgl22/output/xkpib000.xkpib.d14344.t124349.leave

Many thanks - really appreciate your help,

James

comment:3 Changed 5 years ago by James

Hi,

I understand that Annette passed this on to one of her colleagues ahead of leave around mid December, and was just wondering if there'd been any progress?

Many thanks,

James

comment:4 Changed 5 years ago by annette

  • Owner changed from um_support to annette
  • Status changed from new to assigned

Hi James,

Go to Input/ Output Control → Script Inserts and Modifications and add the following environment variable to the table: ATP_ENABLED with value: 1

Then re-run and this should provide a stack trace to help hunt down where the model crashed.

Annette

comment:5 Changed 5 years ago by James

Thanks Annette,

I've added the environment variable and resubmitted the job:

The job directory on host login.archer.ac.uk is:

/home/n02/n02/jgl22/umui_runs/xkpib-013163017

The compilation output will be sent to file:

/home/n02/n02/jgl22/output/xkpib000.xkpib.d15013.t163027.comp.leave

The model output will be sent to file:

/home/n02/n02/jgl22/output/xkpib000.xkpib.d15013.t163027.leave

Really appreciate your help Annette,

James

comment:6 Changed 5 years ago by annette

Hi James,

Just for future reference, you don't need to recompile the UM to get the stack trace.

Take a look at the output yourself to see what is happening. Near the top of the leave file is the following error message:

sys-2 : UNRECOVERABLE error on system request
  No such file or directory

And below this is the call path of the routines that were executing when the model failed (and so produced this error):

ATP Stack walkback for Rank 0 starting:
  _start@start.S:113
  __libc_start_main@libc-start.c:242
  flumemain_@flumeMain.f90:38
  um_shell_@um_shell.f90:3817
  u_model_@u_model.f90:5505
  ukca_main1_@ukca_main1-ukca_main1.f90:7279
  ukca_read_aerosol_@ukca_read_aerosol.f90:469
  _OPEN@0x100c30d
  __OPN@0x100c0bc
  _f_open@0x1009f34
  _ferr@0x1005bfa
  abort@abort.c:92
  raise@pt-raise.c:42
ATP Stack walkback for Rank 0 done

The file ukca_read_aerosol.f90 can be found in the compilation directory for the job:

~jgl22/um/xkpib/ummodel/ppsrc/UM/atmosphere/UKCA

By looking at the code it can be deduced that it is trying to open either Sulfate_SAD_SPARC_1950-2100.asc or Sulfate_SAD_SPARC_Background.asc from directory:

/work/n02/n02/luke/DATA/QESM/

This directory, however, doesn't exist on Archer. I have emailed Luke about this…

Annette

comment:7 Changed 5 years ago by annette

Hi James,

Luke has replied that the files are in:

/work/n02/n02/ukca/ANCILS/QESM/

So you just need to edit the appropriate line in your branch:
https://puma.nerc.ac.uk/trac/UM/browser/UM/branches/dev/james/vn7.3_CheT2_Base/src/atmosphere/UKCA/ukca_read_aerosol.F90?rev=17394

Hopefully this makes sense.

Annette

comment:8 Changed 5 years ago by James

That's absolutely fantastic Annette, and Luke - many thanks to you both!

I've changed the relevant line in the code and will try rerunning.

Really appreciate your help,

James

comment:9 Changed 5 years ago by annette

  • Keywords 12, UKCA, file open added; xhist thist 12 removed
  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.