Opened 2 years ago

Closed 2 years ago

#2182 closed help (answered)

More run failures

Reported by: apm Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version: <select version>

Description

My jobs both failed to survive the first restart after the maintenance shutdown on Wednesday (I can't help starting to see a pattern here!).

Looking at the log files for the latest cycles of each suite tells me this:

u-am228 gives a warning of an inconsistency in the ocean timestep (line 6973 in job.out. What do I need to delete before restarting this cycle?

I restarted u-al390 three times, and got three different failure modes:

  • The first run (01) gave a "Vertical thermo error" in the ice model. On a recommendation from the Met Office, I changed the time steps in both ocean and sea ice components.
  • The first re-run (02) appears to have halted while reading in the forcing fields, with no error message
  • The second re-run (03) says "Error creating restart ncfile".

Is this likely to be a disk space problem? How do I find out?

Thanks!

Alex

Change History (3)

comment:1 Changed 2 years ago by grenville

Alex

Please give us read permissions on your home and work files

chmod -R g+rX /home/n01/n01/alexm
chmod -R g+rX /work/n01/n01/alexm

Grenville

comment:2 Changed 2 years ago by apm

Hi Grenville,

I have done that, as requested. I then discovered that I could no longer log into the espp1 node on Archer with my ssh key: I got the following message:

Permissions 0640 for '/home/n01/n01/alexm/.ssh/id_rsa' are too open. It is required that your private key files are NOT accessible by others. This private key will be ignored.
bad permissions: ignore key: /home/n01/n01/alexm/.ssh/id_rsa

I changed the permissions in my .ssh directory back to user read only and can now log in again without my password.

Anyway, I managed to locate the error in u-al390: it seems that it tried to create a restart file for the CICE sea-ice model that already existed. For some reason, this error only appeared in the job.out file in the log directory, and not in the regular CICE output, which is why I didn't find it before. I deleted the offending restart file and restarted the job, which is now waiting in the queue. I should know by tomorrow morning whether the suite is working again.

As far as u-am228 goes, I realised that CYLC had got completely confused: it had moved on by several cycles without running the model correctly, and the last cycle with viable restart dumps had disappeared from the CYLC window. Do you know how this could have happened?

I decided to cut my losses, and made a copy of u-am228 (as u-am760) and started from scratch again. This suite had only completed three years, so I hadn't lost too much.

Regards,

Alex

comment:3 Changed 2 years ago by grenville

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.