Opened 4 months ago

Closed 4 months ago

#2761 closed help (fixed)

rose suite-run hanging on MONSooN

Reported by: scottan Owned by: ros
Component: UM Model Keywords:
Cc: Platform: Monsoon2
UM Version: 10.9

Description

I am trying to run the UM on monsoon2, and the pop-up panel for rose suite-run is stuck loading indefinitely. I have the same issue on xcslc0 and xcslc1 servers.

Change History (15)

comment:1 Changed 4 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Scott,

I can't currently log in to either exvmsrose or exvmscylc so I would assume this is the cause of your problems running the suite from exvmsrose. You should have moved off of these and now be using xcslc nodes so not worried about being unable to run from the VMs.

You should be ok submitting from xcslc0 or xcslc1 though. I've just successfully started up a suite from there. Can you let us know what the suite id is please?

Regards,
Ros.

comment:2 Changed 4 months ago by scottan

Dear Ros,

Thanks for getting back so quickly. I've long switched over to xcslc, and this problem is happening on both xcslc0 and xcslc1. The suite id is u-be592.

Many thanks,
Scott

comment:3 Changed 4 months ago by ros

Hi Scott,

Ok, I see what the problem is, the suite is explicitly referencing exvmsrose as the host to run the code extraction on!

In site/MONSooN.rc please replace host = 'exvmsrose' with host = {{ROSE_ORIG_HOST}} in the [[EXTRACT_RESOURCE]] section.

I would also advise checking your other suites for this too.

Cheers,
Ros.

comment:4 Changed 4 months ago by scottan

Most of the steps working now, although I get a submit-failed error for fcm_make_pp. Any ideas about that?

Thanks,
Scott

comment:5 Changed 4 months ago by scottan

Scratch that, its worked the second time I tried it.

thanks again,
Scott

comment:6 Changed 4 months ago by scottan

Dear Ros,

Getting submit failed errors again, any advise?

Many thanks,
Scott

comment:7 Changed 4 months ago by ros

Hi Scott,

Other than to retry submitting I don't have any further advice at the moment. It looks like another form of the mkstemp problem usually fixed by resubmitting. There are a few things still being tracked down on Monsoon so if resubmitting doesn't work, I would wait now until tomorrow.

Looks like a new run you are doing so a rose suite-run --new might help. That will delete any existing cylc-run directory for this suite.

Sorry I can't give much more useful advice right now.

Cheers,
Ros.

comment:8 Changed 4 months ago by scottan

Thanks Ros, that helped, fcm_make_um and atmos_main now working. fcm_make_pp is failing straight away though with the following err:

/usr/lib64/python2.6/site-packages/requests/packages/urllib3/connection.py:337: SubjectAltNameWarning: Certificate for xcslc0 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
[FAIL] mirror.target = : incorrect value in declaration
[FAIL] config-file=/working/d04/sanic/cylc-run/u-be592/work/19880901T0000Z/fcm_make_pp/fcm-make.cfg:4
[FAIL] config-file= - file:///home/d04/fcm/srv/svn/moci.xm/main/trunk/Postprocessing/fcm_make/postproc.cfg@1989:12
[FAIL] config-file= -  - file:///home/d04/fcm/srv/svn/moci.xm/main/trunk/Postprocessing/fcm_make/inc/remote.cfg@1989:6

[FAIL] fcm make -f /working/d04/sanic/cylc-run/u-be592/work/19880901T0000Z/fcm_make_pp/fcm-make.cfg -C /home/d04/sanic/cylc-run/u-be592/share/fcm_make_pp -j 4 # return-code=9
2019-02-12T15:52:16Z CRITICAL - failed/EXIT
/usr/lib64/python2.6/site-packages/requests/packages/urllib3/connection.py:337: SubjectAltNameWarning: Certificate for xcslc0 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.)
  SubjectAltNameWarning
Last edited 4 months ago by ros (previous) (diff)

comment:9 Changed 4 months ago by ros

Hi Scott,

In site/MONSooN.rc in the [[HPC]] section only please put the line

host = {{ROSE_ORIG_HOST}}

back to:

host = $(rose host-select xcs-c).

It was only the occurance in [[EXTRACT_RESOURCE]] you needed to change previously. This is causing the fcm_make_pp extract to fail as part of it is now trying to run on the wrong machine.

Try resubmitting - however I suspect you may end up with a mkstemp/jtmp system error - which I and a few others are encountering at the moment.

Cheers,
Ros.

comment:10 Changed 4 months ago by scottan

Hi Ros,

Yes, I changed it from xcs-c to ROSE_ORIG_HOST because previously it was getting a system error. Using ROSE_ORIG_HOST at least allows me to compile and run atmos_main, which is better for now while I am debugging some code, but will need to be fixed long term…

Please let me know if you manage to find a better solution.

Many thanks,
Scott

comment:11 Changed 4 months ago by ros

Hi Scott,

Try replacing {{ROSE_ORIG_HOST}} in the [[EXTRACT_RESOURCE]] section only with localhost. And make sure you revert the [[HPC]] section back to its original state. I don't know why {{ROSE_ORIG_HOST}} is no longer working.

In future if you do start making changes in between posting a query, or getting a different error and us working on it please do tell us so that we don't waste time tracking down a problem that isn't a problem or trying to work on a moving target. Thanks.

Regards,
Ros.


comment:12 Changed 4 months ago by scottan

Hi Ros,

Thanks that seems to be working now. Sorry, I'll make sure to let you know if I make any other changes to try and fix things in the future.

Cheers,
Scott

comment:13 Changed 4 months ago by scottan

Hi Ros,

I'm still having a problem with the postproc and rose_arch_wallclock steps - they seem to be stuck "submit-retrying". I haven't made any other changes since the last comment. Do you have any ideas?

Many thanks,
Scott

comment:14 Changed 4 months ago by ros

Hi Scott,

Your suite has run through to completion for me, so nothing wrong with the suite setup itself. The error message does imply a mkstemp issue and both tasks are being submitted using qsub. A qsub fix dropped out with the patching earlier this week and the Met Office are hopefully going to reapply the fix with an emergency patch today. Once this fix has gone in, if you are still seeing this error I can then pass it to them to investigate.

In the meantime I can only suggest stopping and restarting the suite.

Regards,
Ros.

comment:15 Changed 4 months ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed

I'm assuming this was fixed by the patch and am closing this query now. Please reopen if you are still having problems with this.

Best Regards,
Ros.

Note: See TracTickets for help on using tickets.