Opened 3 years ago

Closed 3 years ago

#2106 closed help (answered)

UKCA vn7.3 Simple Trop Archive fail

Reported by: ajamieson Owned by: um_support
Component: UKCA Keywords:
Cc: Platform: <select platform>
UM Version: 7.3

Description

Trying to use the Simple Troposphere scheme at vn7.3. Job id is xmxwa.

Extract works and .leave says job runs and fails within 3 seconds. On the umui I get the following when checking setup:

umui: xmxwa: Errors and Warnings

Errors will be output in this window
List Check Error in window subindep_PostProc_Gen
Variable: FF2PP_HECTOR

→ Model Selection

→ Post Processing

→ Main Switch + General Questions

Then $DATAM/archive doesn't get created.

Output from .leave file suggests the following is the error with the job run:

/work/n02/n02/acjam/um/xmxwa/bin/qsmaster: Failed in qsexecute in model xmxwa
*

Starting script : qsfinal
Starting time : Tue Mar 14 15:18:13 GMT 2017

*

/work/n02/n02/acjam/um/xmxwa/bin/qsfinal: Model xmxwa - Error: No history files
*

Ending script : qsfinal
Completion code : 135
Completion time : Tue Mar 14 15:18:13 GMT 2017

*

/work/n02/n02/acjam/um/xmxwa/bin/qsmaster: failed in final in model xmxwa

<<<< Information about How Many Lines of Output follow

This isn't an issue I've seen before running the tutorials.

Any help would be great. Thanks

Change History (8)

comment:1 Changed 3 years ago by luke

Hi Andrew,

You've made quite a few changes to the standard job xfvfd. Can I first ask you to take a copy of this job and then make only the changes you need to do to make it run for you (i.e. TIC code, username, extract directories etc.) without any science changes, and then see if you can run this. It will isolate whether there is a problem with you being able to run any job, and you having issues due to changes you have made.

Also, I would not suggest changing the chemistry scheme. While the standard trop scheme sounds appealing as it has less tracers/chemistry it is in fact very different and the convergence settings have been tuned for a low-top model, not a high-top as we have here. Significant work is required to ensure it is producing sensible numbers. I had to up the number of iterations & lower the chemical timestep, and even then I wasn't happy with it at all. Please keep this as TropIsop.

Also, while you've changed the start year and ancillary reference time, these won't actually change any of the forcings in the model as these are all fixed for the year 2000. However, they could be causing issues elsewhere. Can I ask why you want to make these changes in this way?

Thanks,
Luke

comment:2 Changed 3 years ago by ajamieson

Hi Luke,

I copied the job a couple of weeks ago and hadn't realised I had changed so much while exploring the UMUI options. I changed the start year because I'm attempting to run for 50 years and I'm not sure how the model interacts with netCDF files given to in.

I have rerun a xfvfd with no changes. Here output from the .leave

*********************************************************
RCF Executable : /work/n02/n02/acjam/um/xmxwb/bin/qxreconf
*********************************************************

/work/n02/n02/acjam/um/xmxwb/bin/qsexecute[415]: aprun: not found [No such file
or directory]
/work/n02/n02/acjam/um/xmxwb/bin/qsexecute: Error in dump reconfiguration - see
OUTPUT
*****************************************************************
   Ending script   :   qsexecute
   Completion code :   127
   Completion time :   Wed Mar 15 17:07:23 GMT 2017
*****************************************************************

/work/n02/n02/acjam/um/xmxwb/bin/qsmaster: Failed in qsexecute in model xmxwb
*****************************************************************
   Starting script :   qsfinal
   Starting time   :   Wed Mar 15 17:07:23 GMT 2017
*****************************************************************

/work/n02/n02/acjam/um/xmxwb/bin/qsfinal: Model xmxwb - Error: No history files
*****************************************************************
   Ending script   :   qsfinal
   Completion code :   135
   Completion time :   Wed Mar 15 17:07:23 GMT 2017
*****************************************************************

/work/n02/n02/acjam/um/xmxwb/bin/qsmaster: failed in final in model xmxwb

Could it something to do with my .profile file. That is the only thing I've done differently from the UKCA Setting up page.

Thanks for your help and quick response, I really appreciate it.
Andrew

Last edited 3 years ago by ros (previous) (diff)

comment:3 Changed 3 years ago by ros

Hi Andrew,

Try adding the following line to the top of your ~/.profile on ARCHER.

. /etc/bash.bashrc > /dev/null 2>&1

Cheers,
Ros.

comment:4 Changed 3 years ago by ajamieson

Hi Ros,

This seems to have done it. I'll carry on making the scientific changes now.

Thanks for your help, you too Luke.
Andrew

comment:5 Changed 3 years ago by ajamieson

Hi,

So I've made my changes and I'm still getting error messages for qsexecute, however when I run the release job I'm also get an error in qsexecute but they are different. xmxwa is my job and xmxwb is the standard release.

Following is the .leave for xmxwa, however I believe this might be due to my ancillary file and its interference in the reconfiguration, but I'm not sure how to check that;

_pmiu_daemon(SIGCHLD): [NID 00101] [c0-0c1s9n1] [Sun Apr  2 20:57:57 2017] PE RA
NK 4 exit signal Segmentation fault
[NID 00101] 2017-04-02 20:57:57 Apid 26027480: initiated application termination
/work/n02/n02/acjam/um/xmxwa/bin/qsexecute: Error in dump reconfiguration - see
OUTPUT
*****************************************************************
   Ending script   :   qsexecute
   Completion code :   139
   Completion time :   Sun Apr  2 20:57:58 BST 2017
*****************************************************************

/work/n02/n02/acjam/um/xmxwa/bin/qsmaster: Failed in qsexecute in model xmxwa
*****************************************************************
   Starting script :   qsfinal
   Starting time   :   Sun Apr  2 20:57:59 BST 2017
*****************************************************************

/work/n02/n02/acjam/um/xmxwa/bin/qsfinal: Model xmxwa - Error: No history files
*****************************************************************
   Ending script   :   qsfinal
   Completion code :   135
   Completion time :   Sun Apr  2 20:57:59 BST 2017
*****************************************************************

/work/n02/n02/acjam/um/xmxwa/bin/qsmaster: failed in final in model xmxwa

This block is the .leave for xmxwb;

diff: /work/n02/n02/acjam/tmp/tmp.mom3.14140/xmxwb.xhist: No such file or direct
ory
qsexecute: Copying /work/n02/n02/acjam/um/xmxwb/xmxwb.thist to backup thist file
 /work/n02/n02/acjam/um/xmxwb/xmxwb.thist_keep
xmxwb: Run failed
*****************************************************************
   Ending script   :   qsexecute
   Completion code :   137
   Completion time :   Sun Apr  2 21:02:24 BST 2017
*****************************************************************

/work/n02/n02/acjam/um/xmxwb/bin/qsmaster: Failed in qsexecute in model xmxwb
*****************************************************************
   Starting script :   qsfinal
   Starting time   :   Sun Apr  2 21:02:24 BST 2017
*****************************************************************

qsfinal: thist file copied to /work/n02/n02/acjam/um/xmxwb/xmxwb.thist.14441
/work/n02/n02/acjam/um/xmxwb/bin/qsfinal: Error in exit processing after model r
un
Failed in model executable

 STOP
/work/n02/n02/acjam/um/xmxwb/bin/qspickup: Normal completion
 STOP
/work/n02/n02/acjam/um/xmxwb/bin/qshistprint: Job terminated normally
/work/n02/n02/acjam/um/xmxwb/bin/qsresubmit: Error job not resubmitted because o
f error in qsexecute
*****************************************************************
   Ending script   :   qsfinal
   Completion code :   0
   Completion time :   Sun Apr  2 21:02:25 BST 2017
*****************************************************************

It seems to be requested history files again but I'm not familiar with what they are so any help would be fantastic. For reference, job id xmxwc and xmxwc are a and b changed to reflect a potential solution given in ticket #1783.

Thanks a lot, I really appreciate all your help.
Andrew

comment:6 Changed 3 years ago by ajamieson

The following block was above the error code for the standard release job (my copy of it is xmxwb).

*********************************************************
UM Executable : /work/n02/n02/acjam/um/xmxwb/bin/xmxwb.exe
*********************************************************

Rank 0 [Mon Apr  3 12:06:57 2017] [c5-2c1s6n3] Fatal error in PMPI_Allgatherv: Invalid buffer pointer, error stack:
PMPI_Allgatherv(1235): MPI_Allgatherv(sbuf=0x7ffffaae0090, scount=14, dtype=0x4c000829, rbuf=0x7ffffaae0090, rcounts=0x63f4880, displs=0x63f4840, dt
ype=0x4c000829, comm=0x84000004) failed

........

Rank 68 [Mon Apr  3 12:06:57 2017] [c5-2c1s7n1] Fatal error in PMPI_Allgatherv: Invalid buffer pointer, error stack:
PMPI_Allgatherv(1235): MPI_Allgatherv(sbuf=0x7ffffa803aa0, scount=16, dtype=0x4c000829, rbuf=0x7ffffa803950, rcounts=0x63f46c0, displs=0x63f4680, dt
ype=0x4c000829, comm=0x84000002) failed
PMPI_Allgatherv(1183): Buffers must not be aliased. Consider using MPI_IN_PLACE or setting MPICH_NO_BUFFER_ALIAS_CHECK
_pmiu_daemon(SIGCHLD): [NID 04123] [c5-2c1s6n3] [Mon Apr  3 12:06:57 2017] PE RANK 4 exit signal Aborted
_pmiu_daemon(SIGCHLD): [NID 04125] [c5-2c1s7n1] [Mon Apr  3 12:06:57 2017] PE RANK 64 exit signal Aborted
[NID 04123] 2017-04-03 12:06:57 Apid 26030339: initiated application termination
diff: /work/n02/n02/acjam/tmp/tmp.mom3.1387/xmxwb.xhist: No such file or directory
qsexecute: Copying /work/n02/n02/acjam/um/xmxwb/xmxwb.thist to backup thist file /work/n02/n02/acjam/um/xmxwb/xmxwb.thist_keep
xmxwb: Run failed
*****************************************************************
   Ending script   :   qsexecute
   Completion code :   137
   Completion time :   Mon Apr  3 12:07:01 BST 2017
*****************************************************************

I've tried a few new copies of this job using a different arrangement of processor to see if this is the problem (as suggested by the help button on the UMUI), but these are still queuing.

Thanks,
Andrew

comment:7 Changed 3 years ago by grenville

Andrew

Please try adding

MPICH_NO_BUFFER_ALIAS_CHECK=1

in input/output…→script inserts and mod…

We have not seen this problem since we moved to using cce8.3.7, so this is a workaround at best.

Please start a new ticket too.

Grenville

comment:8 Changed 3 years ago by grenville

  • Resolution set to answered
  • Status changed from new to closed

closed because ticket becoming unwieldy

Note: See TracTickets for help on using tickets.