Opened 7 years ago

Closed 7 years ago

#942 closed help (fixed)

Modification of the UM output

Reported by: cplanche Owned by: willie
Component: UM Model Keywords: diagnostics, UM, nesting
Cc: Platform: MONSooN
UM Version: 7.7

Description

Hello,

I already simulated a nesting job (xhjn) with some diagnostics. Now I would like to modify the different diagnostics deleting and/or adding someone. My new job is xhrf.
To do this, I modified the Stash window in the UMUI. However, with this new stash my job now fails. Did I delete important diagnostics or, as it was the first time I did these changes, did I not define the new diagnostics correctly?

Many thanks.

Regards,
Celine

Change History (14)

comment:1 follow-up: Changed 7 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Celine,

Which job specifically?

regards

Willie

comment:2 in reply to: ↑ 1 Changed 7 years ago by cplanche

Hi Willie,

I similarly modified the stash windows of the forecast (xhrfd, f, h and j) jobs, except for the global scale one.
Moreover, the diagnostics of the Convection section are not exactly the same according to the job as the xhrfh and xhrfj jobs do not use the convection parameterization scheme.
Thanks.

Regards,

Celine

comment:3 Changed 7 years ago by willie

Hi Celine,

Focusing on xhrfd, it appears to have completed successfully. At what point is the failure occurring?

Regards,

Willie

comment:4 Changed 7 years ago by cplanche

Hi Willie,

The job which failed is xhrff. This is the error message which is visible in the *.leave file. We can see that the problem comes from the large scale rain, but I don't know why.


Version 7.7 template, Unified Model , Non-Operational
Created by UMUI version 7.7

*
Host is c02c13n06-hf0
Creating directory /scratch/cplanc/xhrff
PATH used = /bin:/usr/bin:/critical/opt/ukmo/mass/moose-monsoon-batch-node-client[[BR]]
/bin/ibm-cn:/usr/bin:/etc:/usr/sbin:/usr/bin/X11:/sbin:/usr/java5/jre/bin:/usr/java5[[BR]]
/bin:/critical/opt/ukmo/supported/bin:/critical/opt/ukmo/freeware/bin:/opt/freeware[[BR]]
/bin:/opt/ukmo/idl/ukmo/bin:/projects/um1/bin:/opt/ukmo/monsoon/bin:/projects[[BR]]
/um1/vn7.7/ibm/utils:/projects/asci/cplanc//CSIP6_120702/xhrf//xhrff/bin:/projects[[BR]]
/um1/vn7.4/ibm/prebuilds/lam_high_noreprod/bin:/projects/um1/vn7.7/ibm/scripts:[[BR]]
/projects/um1/vn7.7/ibm/exec
*

Job started at : Fri Oct 26 15:55:07 GMT 2012
Run started from UMUI
Running from control files in /home/cplanc/umui_runs/xhrff-300155507

4km fcst
This job is running on machine c02c13n06-hf0,
using UM directory /projects/um1,
*

Starting script : qsexecute
Starting time : Fri Oct 26 15:55:08 GMT 2012

*

/projects/um1/vn7.7/ibm/scripts/qsexecute: Executing setup

/projects/um1/vn7.7/ibm/scripts/qssetup: Job terminated normally

/projects/um1/vn7.7/ibm/scripts/qsexecute: Executing model run

*
UM Executable : /home/pfield/xgxbf/ummodel/bin/xgxbf.exec
*

Signal received: SIGFPE - Floating-point exception

Signal generated for floating-point exception:

FP invalid operation

Instruction that generated the exception:

fsub fr05,fr02,fr03
Source Operand values:

fr02 = 6.32510197547516e-01
fr03 = NaNS

Traceback:

Offset 0x0000191c in procedure diagnostics_lsrain_
Offset 0x000099b4 in procedure microphys_ctl_
Offset 0x0000442c in procedure atmos_physics1_
Offset 0x0000ef94 in procedure atm_step_
Offset 0x0008396c in procedure u_model_
Offset 0x00002340 in procedure um_shell_
Offset 0x00000090 in procedure flumemain
—- End of call chain —-

ERROR: 0031-300 Forcing all remote tasks to exit due to exit code 1 in task 0
ERROR: 0031-250 task 1: Terminated
ERROR: 0031-250 task 3: Terminated

Many thanks.

Regards,
Celine

comment:5 Changed 7 years ago by willie

Hi Celine,

The model crashes in the first time step "GCR(2) failed to converge in 200 iterations". This may indicate a problem with the initial data. I would compare the start dump with itself using cumf - if there are any differences reported in the summary file output this must be due to NaNs? in the start dump.

If you do a check setup in the UMUI, some STASH errors are reported. These ought to be investigated and corrected.

Regards

Willie

comment:6 Changed 7 years ago by cplanche

Hi Willie,

I corrected all the Stash errors reported when I do a check setup in the UMUI and submit again the job but it still failed with the same error in the leave file.

If it is a problem with the start dump why this problem didn't appear when I did the first study where there were less diagnostics?

Thanks.

Regards,

Celine

comment:7 Changed 7 years ago by willie

  • Platform set to MONSooN

Hi Celine,

My mistake: although it does not converge on the first time step, it does recover and carry on for 251 time steps, when a NaN appears in diagnostics_lsrain. To diagnose further it will be necessary to rebuild the executable (pfield's xgxbf) with the "flush print buffer if run fails" option on - it's in Section 13 > Diagnostic prints - and then repeat this particular run. We need to get a core file that can be examined in the debugger.

I have checked all the start dumps, ancillaries and lbc's for NaNs? and they are OK.

Regards,

Willie

comment:8 Changed 7 years ago by cplanche

Hello Willie,

My problem is that I would like to rebuild the executable (xhwnf) with your option but it failed. The error message in the *.leave file is:

ld: 0706-006 Cannot find or open library file: -l netcdf

ld:open(): No such file or directory

fcm_internal load failed (65280)
# Time taken: 1 s⇒ mpxlf90_r -o xhwnb.exec /home/cplanc/xhwnb/ummodel/obj/flumemain.o /home/cplanc/xhwnb/ummodel/obj/blkdata.o -L/home/cplanc/xhwnb/ummodel/
lib -L/home/cplanc/xhwnb/umbase/lib -lfcmxhwnb -lmass -lmassvp6 -qsmp=omp -L/projects/um1/gcom/gcom3.6/meto_ibm_pwr6_mpp/lib -lgcom -L/projects/um1/lib -lsig -L/usr
/local/netcdf3.20090102/lib64 -lnetcdf -L/projects/um1/lib -lgrib
gmake: * [xhwnb.exec] Error 1
# Time taken: 842 s⇒ gmake -f /home/cplanc/xhwnb/ummodel/Makefile -j 4 all
gmake -f /home/cplanc/xhwnb/ummodel/Makefile -j 4 all failed (2) at /projects/um1/fcm/bin/../lib/Fcm/Build.pm line 598
cd /home/cplanc
Build failed on Wed Nov 14 17:50:20 2012.
→Make: 842 seconds
→TOTAL: 999 seconds
ATM build failed

Is it a fcm link problem when I moved the pfield's xgxb expt in my umui?

Many thanks.

Cheers,
Celine

comment:9 Changed 7 years ago by willie

Hi Celine,

There are some modifications required due to the recent upgrade of MONSooN - see Getting started on the MONSooN Phase 2 HPC. There's a bit about the netcdf library halfway down.

Regards,

Willie

comment:10 Changed 7 years ago by cplanche

Hi Willie,

I rebuilt new execs with the "flush print buffer if run fails" option on (xhwn expt) and repeated the job which failed (xhrff). Hope everything is correct to examine where the problem comes from.
Many thanks.

Cheers,
Celine

comment:11 Changed 7 years ago by willie

Hi Celine,

Did you run this directly or via the nesting suite? If the latter and you used the one in csip_template_vn7.7, then you probably need to change to,

execid=xhwn
execdir=~cplanc

There is no core file at the moment - I was hoping to get a better idea of the problem from this.

Regards,

Willie

comment:12 Changed 7 years ago by cplanche

Hello Willie,

I changed

 execid=xhwn
 execdir=/home/cplanc/EXECS

in the expt_details hand edit on PUMA and then submit the job thanks to the nssubmit script on Monsoon (~cplanc/um_nesting/vn7.7/).

In order to find the problem I am trying to submit again the job turning off the diagnostics I have added one by one. This should help us to identify which diagnostic is causing the crash.
Many thanks.

Regards,
Celine

comment:13 Changed 7 years ago by cplanche

Hello Willie,

It seems that diagnostics which cause the crash are graupel and qcf2 incr (section 4 items 190/191).
But still don't know why?

Regards,
Celine

comment:14 Changed 7 years ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.