Opened 3 years ago

Closed 3 years ago

#1899 closed error (answered)

qsserver failure in archiving

Reported by: rm024650 Owned by: um_support
Component: Archiving Keywords:
Cc: Platform: ARCHER
UM Version: 8.5

Description

Dear CMS helpdesk

I am trying to run a vn8.5 job with automatic archiving. I think the run failed during the archiving stage with the error message below. Both the job and archiving worked fine before and this is just a re-run with more diagnostics included, so I'm not sure what is causing the problem. Thank you for you help.

The .leave file is /home/n02/n02/rm024650/output/xmkxe000.xmkxe.d16172.t150919.leave

qsserver: Wed Jun 22 02:14:30 BST 2016: xmkxea.da19820501_00 ARCHIVE DUMP
/work/n02/n02/rm024650/xmkxe/bin/qshector_arch[52]: : cannot open
/work/n02/n02/rm024650/xmkxe/bin/qshector_arch[53]: : cannot open
/work/n02/n02/rm024650/xmkxe/bin/qshector_arch[54]: : cannot open

==============================================================================
=================================== ERRFLAG ==================================
==============================================================================

T qsserver failure at Wed Jun 22 02:15:03 BST 2016
qsserver: EOF on PIPE but model still executing - waiting
_pmiu_daemon(SIGCHLD): [NID 04886] [c1-3c1s5n2] [Wed Jun 22 02:15:05 2016] PE RANK 16 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 00764] [c3-0c2s15n0] [Wed Jun 22 02:15:05 2016] PE RANK 60 exit signal Segmentation fault
[NID 00764] 2016-06-22 02:15:05 Apid 22250629: initiated application termination
qsatmos: waiting for qsserver to complete on pid 10876
qscasedisp: return code after calling qshector_arch RCARC=2
/work/n02/n02/rm024650/xmkxe/bin/qsserver[453]: .[187]: : cannot open
/work/n02/n02/rm024650/xmkxe/bin/qsserver[453]: .[188]: : cannot open
/work/n02/n02/rm024650/xmkxe/bin/qsserver[453]: .[189]: : cannot open
/work/n02/n02/rm024650/xmkxe/bin/qsserver[453]: .[190]: : cannot open
/work/n02/n02/rm024650/xmkxe/bin/qsserver[453]: .[192]: : cannot open
/work/n02/n02/rm024650/xmkxe/bin/qsserver[453]: .[193]: : cannot open
/work/n02/n02/rm024650/xmkxe/bin/qsserver[453]: .[194]: : cannot open

==============================================================================
=================================== ERRFLAG ==================================
==============================================================================

T qsserver failure at Wed Jun 22 02:15:33 BST 2016
xmkxe: Run failed

Change History (3)

comment:1 Changed 3 years ago by grenville

Hi

Sorry for the delay - we do see problems with archiving which result from communication errors between the different parts of ARCHER involved. These are very hard to track down and we do not have concrete advice on how to get round them. We do have code which retires on comms failure, but that's not yet in full release. I can only suggest that you resubmit the job.

Grenville

comment:2 Changed 3 years ago by rm024650

Hi

No problem. Please do update the FAQ / circulate once a solution is available. Thanks. You can close this ticket now.

Mike

comment:3 Changed 3 years ago by ros

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.