Opened 2 years ago

Closed 2 years ago

#2178 closed help (answered)

failure Cannot connect: https://exvmscylc.monsoon-metoffice.co.uk

Reported by: ggxmy Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version: 10.6

Description

Dear CMS,

I ran a few UM vn10.6.1 jobs (u-al5xx; xx=56, 58-61, 63, 64) earlier this month and all (except u-al560) seem to have failed for very similar errors. In some jobs I found errors like these;

Send message: try 7 of 7 failed: Cannot connect: https://exvmscylc.monsoon-metoffice.co.uk:43029/message/put?priority=NORMAL&message=succeeded+at+2017-05-09T10%3A54%3A53Z&task_id=atmos_main.20011101T0000Z: HTTP Error 401: digest auth failed

while in others like these;

Send message: try 7 of 7 failed: Cannot connect: https://exvmscylc.monsoon-metoffice.co.uk:43123/message/put?priority=NORMAL&message=succeeded+at+2017-05-09T08%3A29%3A55Z&task_id=atmos_main.20011001T0000Z: <urlopen error [Errno 111] Connection refused>
WARNING: MESSAGE SEND FAILED

in their job.err .

These are the first long simulations I tried to ran after I received a MOOSE credential file and turned on archiving/post processing on. So I wonder if there is any problem on MASS (like quota)? Or maybe the server went down at that time? All of these seem to have occurred on 9th May although at different times.

I don't see any clue why u-al560 stopped. job.err doesn't show any error. Maybe I submitted the job before I changed the run length???

Please could you tell me how I can fix the problem and how I can resubmit these jobs for continuation?

Thanks,
Masaru

Attachments (2)

end date mismatch.jpg (105.1 KB) - added by ggxmy 2 years ago.
end date mismatch?
stopped suite.jpg (62.1 KB) - added by ggxmy 2 years ago.
stopped suite

Download all attachments as: .zip

Change History (7)

comment:1 Changed 2 years ago by ros

Hi Masaru,

The Rose & Cylc VMs were both rebooted on the 9th May. All you should need to do is restart the suites with rose suite-run --restart and they should all pick up where they left off.

Regards,
Ros.

Changed 2 years ago by ggxmy

end date mismatch?

comment:2 Changed 2 years ago by ggxmy

Hi Ros.,

OK. Thank you. I did that and all but one suite seems to be running now.

However, as shown on the attached image, processes of u-al564 are shown to be 'held' and the bottom left corner shows "running to stop at 199812010000Z" although actual end date should be December 2002. "running to stop at 200212010000Z" is shown for all other jobs. Could you help me fix this?

Thanks,
Masaru

comment:3 Changed 2 years ago by ros

Hi Masaru,

The only thing I can suggest is to stop the suite again and then restart it. Make sure to do rose suite-run --restart ( Note: NOT rose suite-restart incase you had done that) so that it reloads the suite definition. I've just tried submitting your suite and it has the correct end date of 20021201 so the suite is setup correctly.

Regards,
Ros.

Changed 2 years ago by ggxmy

stopped suite

comment:4 Changed 2 years ago by ggxmy

Hi Ros.,

Yesterday after that I tried triggering run and the suite seemed to be running. This morning it was stopped however. I did rose suite-run --restart after your suggestion and the end date has been corrected but the suite does not run for no obvious reason. Cylc_gui panel looks like this;

stopped suite

So hosuekeeping is waiting for something but otherwise all processes went successfully? Then I don't know why simulation does not continue. Do you know how I can restart this suite?

In /home/d03/myosh/cylc-run/u-al564/log/job/ there are folders like below. So the simulation has once gone through 20011101. I'm not sure why the job tried to simulate 1999 again. Maybe I have messed up something? I wonder if these issues are related to each other.

drwxr-xr-x. 4 myosh mo_users 4096 May 24 16:36 19990101T0000Z
drwxr-xr-x. 3 myosh mo_users 4096 May 24 16:19 19990201T0000Z
drwxr-xr-x. 3 myosh mo_users 4096 May 24 16:21 19990301T0000Z
drwxr-xr-x. 3 myosh mo_users 4096 May 24 16:36 19990401T0000Z
drwxr-xr-x. 5 myosh mo_users 4096 May  8 02:30 20001101T0000Z
drwxr-xr-x. 5 myosh mo_users 4096 May  8 05:14 20001201T0000Z
drwxr-xr-x. 5 myosh mo_users 4096 May  8 07:55 20010101T0000Z
drwxr-xr-x. 5 myosh mo_users 4096 May  8 10:32 20010201T0000Z
drwxr-xr-x. 5 myosh mo_users 4096 May  8 13:17 20010301T0000Z
drwxr-xr-x. 5 myosh mo_users 4096 May  8 15:53 20010401T0000Z
drwxr-xr-x. 5 myosh mo_users 4096 May  8 18:37 20010501T0000Z
drwxr-xr-x. 5 myosh mo_users 4096 May  8 21:20 20010601T0000Z
drwxr-xr-x. 5 myosh mo_users 4096 May  9 00:02 20010701T0000Z
drwxr-xr-x. 5 myosh mo_users 4096 May  9 02:56 20010801T0000Z
drwxr-xr-x. 5 myosh mo_users 4096 May  9 05:34 20010901T0000Z
drwxr-xr-x. 5 myosh mo_users 4096 May  9 08:14 20011001T0000Z
drwxr-xr-x. 3 myosh mo_users 4096 May  9 08:12 20011101T0000Z

Regards,
Masaru

comment:5 Changed 2 years ago by ros

  • Resolution set to answered
  • Status changed from new to closed

Continued in #2181

Note: See TracTickets for help on using tickets.