Opened 5 months ago

Last modified 2 weeks ago

#2281 reopened help

rose/cylc communication error messages

Reported by: pmcguire Owned by: um_support
Priority: normal Component: Rose/Cylc
Keywords: jasmin, rose/cylc Cc:
Platform: Other UM Version:

Description

I often run suites directly on Jasmin-sci1 (instead of submitting from PUMA to run on Jasmin-sci1).
I often get error messages in stderr like the following. How can I configure my suites properly so that these communication error messages don't happen?
Thanks.

Send message: try 1 of 7 failed: Connection timeout: https://jasmin-sci1.ceda.ac.uk:43005/message/put?priority=NORMAL&message=started+at+2017-09-21T13%3A59%3A25%2B01&task_id=make_plots.1: HTTPSConnectionPool(host='jasmin-sci1.ceda.ac.uk', port=43005): Max retries exceeded with url: /message/put?priority=NORMAL&message=started+at+2017-09-21T13%3A59%3A25%2B01&task_id=make_plots.1 (Caused by ConnectTimeoutError?(<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x1852f50>, 'Connection to jasmin-sci1.ceda.ac.uk timed out. (connect timeout=30.0)'))

retry in 5.0 seconds, timeout is 30.0

Change History (11)

comment:1 Changed 5 months ago by willie

Hi Patrick,

Is it a particular suite or all suites? Let us know the suite id.

Regards
Willie

comment:2 Changed 5 months ago by pmcguire

I am not sure if I get this error message for every suite that needs to report an error or not.
But one suite that had this problem when it was reporting errors during the debug stage was u-aq202.
Patrick

comment:3 Changed 3 months ago by pmcguire

I sometimes still have problems with this.
Patrick

comment:4 Changed 2 months ago by willie

  • Resolution set to answered
  • Status changed from new to closed

comment:5 Changed 2 months ago by pmcguire

Unfortunately, I sometimes still have problems with this. Do you have any suggestions for what I should do to figure out the cause of the problem? Can you reopen the case?
Patrick

comment:6 Changed 7 weeks ago by ros

  • Platform set to Other
  • Resolution answered deleted
  • Status changed from closed to reopened
  • UM Version <select version> deleted

Hi Patrick,

As you are submitting direct from the JASMIN VMs this looks to be an intermittent communication issue within JASMIN domain. We unfortunately don't know what is causing this. Does this only happen with specific tasks (e.g. only the build step)? I would suggest contacting the CEDA helpdesk. Rose/Cylc is maintained on JASMIN by the CEDA/Met Office so I would hope that they would be able to help/investigate - perhaps they are aware of others that have experienced similar…

Regards,
Ros

comment:7 Changed 6 weeks ago by pmcguire

I haven't noticed a pattern for this happening on specific tasks. I will contact the CEDA Helpdesk about this. Thanks for your help!
Patrick

comment:8 Changed 3 weeks ago by willie

  • Resolution set to fixed
  • Status changed from reopened to closed

comment:9 Changed 3 weeks ago by pmcguire

This problem has not been fixed yet. I am currently working with Annette Osprey (CMS), Alan Iwi (CEDA), and Ag Stephens (CEDA?) on this. There is a whole SLACK discussion going on right now about it, under #rose-cylc-jasmin .

comment:10 Changed 2 weeks ago by pmcguire

The latest information from CEDA is that we should be using (for a virtual machine) jasmin-cylc instead of jasmin-sci*. Then those https communication errors go away. But jasmin-cylc doesn't currently support GUIs from Rose/Cylc?, and its Python setup is different than jasmin-sci*.

comment:11 Changed 2 weeks ago by pmcguire

  • Resolution fixed deleted
  • Status changed from closed to reopened
Note: See TracTickets for help on using tickets.