Opened 4 years ago

Closed 4 years ago

#1758 closed help (answered)

Error in dump reconfiguartion

Reported by: dilshadshawki Owned by: annette
Component: UM Model Keywords: reconfiguration, dump
Cc: Platform: MONSooN
UM Version: 8.4

Description

Hello Helpdesk,

Your regular UMUI user in need of help again!

After running the job xlzyb, (UKCA 8.4 job), I get the following error in the .rcf.leave file:

/home/dshawk/output/xlzyb000.xlzyb.d15336.t153517.rcf.leave
[NID 00157] 2015-12-02 15:51:09 Exec /projects/ukca-imp/dshawk/xlzyb/bin/qxreconf failed: chdir /work/scratch/jtmp/pbs.326348
.xcm00.x8z No such file or directory
/projects/ukca-imp/dshawk/xlzyb/bin/qsrecon: Error in dump reconfiguration - see OUTPUT

The directory /work/scratch/jtmp…. does not exist! I checked OUTPUT which is lower down but it does not give any useful information:

 ==============================================================================
 =================================== OUTPUT ===================================
 ==============================================================================

 UMUI Namelist output in /projects/ukca-imp/dshawk/xlzyb/xlzyb.umui.nl
 DATAW/DATAM file listing in /projects/ukca-imp/dshawk/xlzyb/xlzyb.list
 STASH output should be in /projects/ukca-imp/dshawk/xlzyb/xlzyb.stash

I have run reconfiguration since I prevented some diagnostics from outputting in the STASH section of the UM, as I thought you need to reconfigure whenever an ancil or stash diagnostic is changed.

Please help!

Dill

Change History (13)

comment:1 Changed 4 years ago by annette

  • Owner changed from um_support to annette
  • Status changed from new to assigned

Hi Dill,

I'm not sure what is happening here. Plus I can't seem to open your UMUI job - does that work OK for you?

It looks like you have run something else since the error listed above? Are you still getting the same problem?

Annette

comment:2 Changed 4 years ago by annette

  • Status changed from assigned to pending

comment:3 Changed 4 years ago by dilshadshawki

Hi Annette,

For a few minutes I wasn't able to open the job on the umui either. But then I was getting a funny error message on the puma command line. I closed everything and reopened and now I am able to open the job in the UMUI.

Yes I did it again after removing many diagnostics from STASH as it was complaining that there were too many diagnostics being output whenever I clicked on 'Verify Diagnostics'. But I still get a error although one that seems to be different:

/projects/ukca-imp/dshawk/xlzyb/bin/qsrecon[123]: cd: /work/scratch/jtmp/pbs.330999.xcm00.x8z: [No such file or directory]

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: Calc_nlookups
? Error Code:    10
? Error Message: Ancillary files have not been found - Check output for details
? Error generated from processor:     0
? This run generated   1 warnings
????????????????????????????????????????????????????????????????????????????????

This is from the rcf.leave file:

/home/dshawk/output/xlzyb000.xlzyb.d15338.t153623.rcf.leave

The rest of the error looks quite similar to another error I am getting when trying to run a another job but I will open a new ticket for this as it may be a separate issue.

Rank 0 [Fri Dec  4 15:51:50 2015] [c0-0c2s5n1] application called MPI_Abort(MPI_COMM_WORLD, 9) - process 0
Application 251537 is crashing. ATP analysis proceeding...

ATP Stack walkback for Rank 0 starting:
  [empty]@0x7fffffff07ff
  rcf_ancil$rcf_ancil_mod_@rcf_ancil_mod.f90:76
  rcf_ancil_atmos$rcf_ancil_atmos_mod_@rcf_ancil_atmos_mod.f90:661
  calc_nlookups$calc_nlookups_mod_@calc_nlookups_mod.f90:240
  ereport64$ereport_mod_@ereport_mod.f90:107
  gc_abort_@gc_abort.F90:136
  mpl_abort_@mpl_abort.F90:46
  pmpi_abort@0x668a2c
  MPI_Abort@0x686ea4
  MPID_Abort@0x6af361
  abort@abort.c:92
  raise@pt-raise.c:42
ATP Stack walkback for Rank 0 done
Process died with signal 6: 'Aborted'
Forcing core dumps of ranks 0, 1
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

_pmiu_daemon(SIGCHLD): [NID 00149] [c0-0c2s5n1] [Fri Dec  4 15:52:15 2015] PE RANK 3 exit signal Killed
/projects/ukca-imp/dshawk/xlzyb/bin/qsrecon: Error in dump reconfiguration - see OUTPUT
*****************************************************************
   Ending script   :   qsrecon
   Completion code :   137
   Completion time :   Fri Dec  4 15:52:15 GMT 2015
*****************************************************************

I am very confused as the job is simply a copy of another job (xlyoa) that does work and only the STASH diagnostics have been modified.

Best wishes,
Dill

comment:4 Changed 4 years ago by annette

Hi Dill,

If you look a bit further up in the leave file, there is an ancil file it can't find:

 Ancillary File does not exist.
 File : 200/AR5_aero_2000        

I still can't open your UMUI job but looking in your umui_jobs directory I can see the following in INITFILEENV

export USRANCIL=$UKCA_EMISS300200/AR5_aero_2000

I'd say that the path to this ancillary has become corrupted - I don't think the 300200 should be there. Correct this in the UMUI and see if that fixes it.

Annette

comment:5 Changed 4 years ago by dilshadshawki

Hi Annette

Something very strange is happening here. I decided to play with the directory of the ancil file and no matter what I change it to ( unless I completely avoid using the ancillary) it always adds on '300200' to the end of the path name?

I am baffled.

The ancil is in: Atmosphere → Ancillary & input data files → Climatologies and potential climatologies → User single-level ancillary file & fields

I even changed the directory name to /spongebob but it still adds on 300200 when I process it and check in the INITFILEENV file that you mentioned above.

What could be going on here?

Dill

comment:6 Changed 4 years ago by annette

Hi Dill,

I am not sure what is going on here either. Can you try a few things?

  • Make sure you have Num Lock off on your keyboard as that can add strange characters in the UMUI.
  • Can you successfully edit that field in another job at the same UM version?
  • It may be that the job has become corrupted. Download the basis file by clicking "Export", then create a new UMUI job and upload the basis file (click "Import", close the job, then re-open).

Annette

comment:7 Changed 4 years ago by dilshadshawki

Hi Annette,

I checked another job that it was originally copied from and I could change the directory without any problems. I did as you said and exported the basis file and used this to create a new job, but the problem is still there when I process and check the INITFILEENV file.

I don't have a Num Lock on my keyboard, at the moment I am using a Mac, although I did create the job on a windows computer in my office originally.

I think I might just copy another job that doesn't have this problem and redo the stash diagnostics, which is the main thing to do, I was just wandering if there was any other way around this?

It is very strange, do you know why jobs can become corrupted like this?

Dill

comment:8 Changed 4 years ago by annette

Hi Dill,

Try editing the basis_file to remove the weird characters. Just search for "UKCA_EMISS" and you should find it. This works for me!

If it works in the test job, you could then upload it to your original job and see if that works.

I haven't seen anything like this before, so I'm not sure what has happened.

Annette

comment:9 Changed 4 years ago by dilshadshawki

Hi Annette,

Editing the basis_file and removing the characters that didn't seem to belong there did work as the address of UKCA_ANCIL doesn't have the weird 300200 added on anymore.

However, I am still getting the original error:

/projects/ukca-imp/dshawk/xlzyb/bin/qsrecon[123]: cd: /work/scratch/jtmp/pbs.352215.xcm00.x8z: [No such file or directory]
aprun: -N cannot exceed -n
/projects/ukca-imp/dshawk/xlzyb/bin/qsrecon: Error in dump reconfiguration - see OUTPUT

Any ideas?

Cheers,
Dill

comment:10 Changed 4 years ago by annette

Hi Dill,

This is a different problem from before, and it has appeared because of the fix we put in to stop the slow-down.

In the UMUI, if you navigate to Reconfiguration → General Reconfiguration Options you will see that the reconfiguration is set to run on 2x2 pes. The error occurs because the aprun command now specifies to use 32 MPI tasks per node, so you would need to modify this to run with only 4 MPI tasks in total.

The best thing to do, however, is increase the number of reconfiguration pes to either the same as the atmos model (12x16) or at least 32 (eg 4x8).

Annette

comment:11 Changed 4 years ago by dilshadshawki

Hi Annette,

So I changed the reconfiguration to run on 12x16 pes and it failed to run as it exceeded the walltime, then I just run it again but this time setting the reconfiguration to run on 4x8 pes ad it got past the reconfiguration stage but then failed at the running stage:

/home/dshawk/output/xlzyb000.xlzyb.d15350.t112056.leave
Try `basename --help' for more information.
basename: missing operand
Try `basename --help' for more information.
basename: missing operand
Try `basename --help' for more information.
View application merged backtrace tree with: stat-view atpMergedBT.dot
You may need to: module load stat

basename: missing operand
Try `basename --help' for more information.
_pmiu_daemon(SIGCHLD): [NID 00142] [c0-0c2s3n2] [Wed Dec 16 11:34:36 2015] PE RANK 128 exit signal Killed
_pmiu_daemon(SIGCHLD): [NID 00141] [c0-0c2s3n1] [Wed Dec 16 11:34:36 2015] PE RANK 108 exit signal Killed
[NID 00141] 2015-12-16 11:34:36 Apid 275146: initiated application termination
basename: missing operand
Try `basename --help' for more information.
basename: missing operand
Try `basename --help' for more information.
xlzyb: Run failed

This seems to be similar to the error before. Is there anything else I could try?

Cheers,
Dill

comment:12 Changed 4 years ago by annette

Dill,

This looks like a model instability again. The basename error is a bit of a red herring (it's a harmless script error I think). If you scroll to the bottom of the leave file you can see that there is a !NaN in the scientific output:

 Minimum theta level 1 for timestep  25
                This timestep                         This run
   Min theta1     proc          position            Min theta1 timestep
      271.81     109 -1331.3deg W    -583.8deg S       224.76     7
  Largest negative delta theta1 at minimum theta1 
 This timestep =      NaNK. At min for run =    -4.31K

What I would suggest is:

  • Carefully check the start dump looks OK, as well as any ancillary files you have recently added or edited. You can cumf the file with itself to see if it contains NaNs (I have already done this for your start file and it seems OK).
  • You can write dumps at every timestep, which can help track where the error has appeared. In Atmosphere → Control → Post-processing, Dumping and Meaning switch on "Irregular dumps", select "Next", then edit the table to create dumps for a few of the time-steps right before the model crashes.
  • Reduce the time-step length.

Annette

comment:13 Changed 4 years ago by annette

  • Resolution set to answered
  • Status changed from pending to closed

Closing ticket due to lack of activity.

Annette

Note: See TracTickets for help on using tickets.