Opened 2 years ago
Closed 2 years ago
#2761 closed help (fixed)
rose suite-run hanging on MONSooN
Reported by: | scottan | Owned by: | ros |
---|---|---|---|
Component: | UM Model | Keywords: | |
Cc: | Platform: | Monsoon2 | |
UM Version: | 10.9 |
Description
I am trying to run the UM on monsoon2, and the pop-up panel for rose suite-run is stuck loading indefinitely. I have the same issue on xcslc0 and xcslc1 servers.
Change History (15)
comment:1 Changed 2 years ago by ros
- Owner changed from um_support to ros
- Status changed from new to accepted
comment:2 Changed 2 years ago by scottan
Dear Ros,
Thanks for getting back so quickly. I've long switched over to xcslc, and this problem is happening on both xcslc0 and xcslc1. The suite id is u-be592.
Many thanks,
Scott
comment:3 Changed 2 years ago by ros
Hi Scott,
Ok, I see what the problem is, the suite is explicitly referencing exvmsrose as the host to run the code extraction on!
In site/MONSooN.rc please replace host = 'exvmsrose' with host = {{ROSE_ORIG_HOST}} in the [[EXTRACT_RESOURCE]] section.
I would also advise checking your other suites for this too.
Cheers,
Ros.
comment:4 Changed 2 years ago by scottan
Most of the steps working now, although I get a submit-failed error for fcm_make_pp. Any ideas about that?
Thanks,
Scott
comment:5 Changed 2 years ago by scottan
Scratch that, its worked the second time I tried it.
thanks again,
Scott
comment:6 Changed 2 years ago by scottan
Dear Ros,
Getting submit failed errors again, any advise?
Many thanks,
Scott
comment:7 Changed 2 years ago by ros
Hi Scott,
Other than to retry submitting I don't have any further advice at the moment. It looks like another form of the mkstemp problem usually fixed by resubmitting. There are a few things still being tracked down on Monsoon so if resubmitting doesn't work, I would wait now until tomorrow.
Looks like a new run you are doing so a rose suite-run --new might help. That will delete any existing cylc-run directory for this suite.
Sorry I can't give much more useful advice right now.
Cheers,
Ros.
comment:8 Changed 2 years ago by scottan
Thanks Ros, that helped, fcm_make_um and atmos_main now working. fcm_make_pp is failing straight away though with the following err:
/usr/lib64/python2.6/site-packages/requests/packages/urllib3/connection.py:337: SubjectAltNameWarning: Certificate for xcslc0 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning [FAIL] mirror.target = : incorrect value in declaration [FAIL] config-file=/working/d04/sanic/cylc-run/u-be592/work/19880901T0000Z/fcm_make_pp/fcm-make.cfg:4 [FAIL] config-file= - file:///home/d04/fcm/srv/svn/moci.xm/main/trunk/Postprocessing/fcm_make/postproc.cfg@1989:12 [FAIL] config-file= - - file:///home/d04/fcm/srv/svn/moci.xm/main/trunk/Postprocessing/fcm_make/inc/remote.cfg@1989:6 [FAIL] fcm make -f /working/d04/sanic/cylc-run/u-be592/work/19880901T0000Z/fcm_make_pp/fcm-make.cfg -C /home/d04/sanic/cylc-run/u-be592/share/fcm_make_pp -j 4 # return-code=9 2019-02-12T15:52:16Z CRITICAL - failed/EXIT /usr/lib64/python2.6/site-packages/requests/packages/urllib3/connection.py:337: SubjectAltNameWarning: Certificate for xcslc0 has no `subjectAltName`, falling back to check for a `commonName` for now. This feature is being removed by major browsers and deprecated by RFC 2818. (See https://github.com/shazow/urllib3/issues/497 for details.) SubjectAltNameWarning
comment:9 Changed 2 years ago by ros
Hi Scott,
In site/MONSooN.rc in the [[HPC]] section only please put the line
host = {{ROSE_ORIG_HOST}}
back to:
host = $(rose host-select xcs-c).
It was only the occurance in [[EXTRACT_RESOURCE]] you needed to change previously. This is causing the fcm_make_pp extract to fail as part of it is now trying to run on the wrong machine.
Try resubmitting - however I suspect you may end up with a mkstemp/jtmp system error - which I and a few others are encountering at the moment.
Cheers,
Ros.
comment:10 Changed 2 years ago by scottan
Hi Ros,
Yes, I changed it from xcs-c to ROSE_ORIG_HOST because previously it was getting a system error. Using ROSE_ORIG_HOST at least allows me to compile and run atmos_main, which is better for now while I am debugging some code, but will need to be fixed long term…
Please let me know if you manage to find a better solution.
Many thanks,
Scott
comment:11 Changed 2 years ago by ros
Hi Scott,
Try replacing {{ROSE_ORIG_HOST}} in the [[EXTRACT_RESOURCE]] section only with localhost. And make sure you revert the [[HPC]] section back to its original state. I don't know why {{ROSE_ORIG_HOST}} is no longer working.
In future if you do start making changes in between posting a query, or getting a different error and us working on it please do tell us so that we don't waste time tracking down a problem that isn't a problem or trying to work on a moving target. Thanks.
Regards,
Ros.
comment:12 Changed 2 years ago by scottan
Hi Ros,
Thanks that seems to be working now. Sorry, I'll make sure to let you know if I make any other changes to try and fix things in the future.
Cheers,
Scott
comment:13 Changed 2 years ago by scottan
Hi Ros,
I'm still having a problem with the postproc and rose_arch_wallclock steps - they seem to be stuck "submit-retrying". I haven't made any other changes since the last comment. Do you have any ideas?
Many thanks,
Scott
comment:14 Changed 2 years ago by ros
Hi Scott,
Your suite has run through to completion for me, so nothing wrong with the suite setup itself. The error message does imply a mkstemp issue and both tasks are being submitted using qsub. A qsub fix dropped out with the patching earlier this week and the Met Office are hopefully going to reapply the fix with an emergency patch today. Once this fix has gone in, if you are still seeing this error I can then pass it to them to investigate.
In the meantime I can only suggest stopping and restarting the suite.
Regards,
Ros.
comment:15 Changed 2 years ago by ros
- Resolution set to fixed
- Status changed from accepted to closed
I'm assuming this was fixed by the patch and am closing this query now. Please reopen if you are still having problems with this.
Best Regards,
Ros.
Hi Scott,
I can't currently log in to either exvmsrose or exvmscylc so I would assume this is the cause of your problems running the suite from exvmsrose. You should have moved off of these and now be using xcslc nodes so not worried about being unable to run from the VMs.
You should be ok submitting from xcslc0 or xcslc1 though. I've just successfully started up a suite from there. Can you let us know what the suite id is please?
Regards,
Ros.