Opened 3 months ago

Closed 2 months ago

#3250 closed help (fixed)

pptransfer does not show on Gcylc

Reported by: yb19052 Owned by: ros
Component: Rose/Cylc Keywords:
Cc: Platform: NEXCS
UM Version: 10.7

Description

Hello CMS,

I run a suite (u-bp881) and it worked for the last 10 model years. But, the suite does not run and the status is "waiting" in the "housekeeping" on Gcylc because the "pptransfer" in the same year (18600701T000Z) does not show.

Here I have questions? How do I put the "pptransfer" task alone on the specific model year (18600701T000Z)? Otherwise, should I rerun the suite from the previous year or from the first year? if I do so, how do I implement it?

I appreciate your kind cooperation.

Thanks
Kenji

Change History (18)

comment:1 Changed 3 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Kenji,

I don't understand your comment about the pptransfer task not showing in the cylc GUI as looking in the log files at /home/d03/kizumi/cylc-run/u-bp881/log/job/18600701T0000Z the pptransfer task has run and failed 6 times, so it should be showing up red in the GUI.

If you look in the /home/d03/kizumi/cylc-run/u-bp881/log/job/18600701T0000Z/pptransfer/NN/job.err file you will see that it failed to transfer over 3 files correctly indicated as checksum failed. You will need to manually manually copy these files over to jasmin and then set the status of the 18600701T000Z pptransfer to succeeded. Can you attach a screenshot of the cylc GUI so we can see what is shown please.

Regards,
Ros

comment:2 Changed 3 months ago by ros

Hi Kenji,

Thanks for the screenshot.

I have no idea why the task is now not showing in the GUI when it has run.

My first suggestion is to stop the suite and then restart with rose suite-restart. If it still doesn't show the pptransfer task then we'll have to try reinserting the task.

To insert a task do:

  • In the Cylc GUI: Control —> Insert Task(s)…
  • Set TASK-NAME.CYCLE-POINT=pptransfer.18600701T000Z
  • Leave stop-point=POINT blank
  • Check the "Do not check if a cycle point is valid or not" box
  • Insert, and wait for the task to complete.

Hopefully one of those suggestions will work.

Cheers,
Ros.

P.S. To attach a file to a ticket there is an attach button under the ticket description further up this page.

comment:3 Changed 3 months ago by yb19052

Hello Ros,

I appreciate your response. Using the second option, I could insert the pptransfer task.
But, the task does not work well because of the following error:

[WARN] file:atmospp.nl: skip missing optional source: namelist:moose_arch
[WARN] file:nemocicepp.nl: skip missing optional source: namelist:moose_arch
[WARN] [SUBPROCESS]: Command: ssh -oBatchMode=yes jasmin-xfer2.ceda.ac.uk -n cd /gws/nopw/j04/pmip4_vol2/users/kizumi/u-bp881/18600701T0000Z ; md5sum -c checksums
[SUBPROCESS]: Error = 1:

Access to this system is monitored and restricted to
authorised users. If you do not have authorisation
to use this system, you should not proceed beyond
this point and should disconnect immediately.

Unauthorised use could lead to prosecution.

(See also - http://www.stfc.ac.uk/aup)

bp881a.p818600811.pp: OK
bp881a.pd1860oct.pp: OK

bp881a.p618601011.pp: FAILED

bp881a.p818600801.pp: FAILED

bp881a.p918600711.pp: OK
nemo_bp881o_1s_18591201-18600301_scalar.nc: OK
md5sum: WARNING: 3 of 293 computed checksums did NOT match

[ERROR] Checksum verification failed.
[FAIL] Command Terminated
[FAIL] Terminating PostProc?
[FAIL] transfer.py # return-code=1

I do not know that this error comes from JASMIN, postproc tasks, or others.
How do I fix this issue? Should I make a new ticket for it?

I appreciate your kind cooperation.

Sincerely
Kenji

comment:4 Changed 3 months ago by ros

Hi Kenji,

See my response in comment 1 above; 3 files have failed to transfer properly. The easiest thing is to manually copy these failed files across to JASMIN (using rsync or even scp) and then trigger the failed pptransfer task to validate the checksums.

Cheers,
Ros.

comment:5 Changed 3 months ago by yb19052

Hi Ros,

Before inserting the pptransfer task, I checked the three files (bp881a.p618601011.pp, bp881a.p618600801.pp, and bp881a.p818600801.pp) at "/home/d03/kizumi/cylc-run/u-bp881/share/data/History_Data" and "/gws/nopw/j04/pmip4_vol2/users/kizumi/u-bp881/18600701T0000Z". These files have already been at the JASMIN. So i do not know why we still have the same error messages.

Thanks
Kenji

comment:6 Changed 3 months ago by ros

Hi Kenji,

During one of the transfer attempts for this cycle the connection with JASMIN was unexpectedly terminated which can leave files partially transferred or corrupt. The checksum verification, done at the end of the pptransfer task, checks that the files have made it onto JASMIN correctly. The 3 files listed as FAILED have not transferred across properly.

Regards,
Ros.

comment:7 Changed 3 months ago by yb19052

Hi Ros

Because the 3 files are corrupt, I need to run all processes (coupled/postproc/pptransfer) at 18600701T0000Z again. When I stopped the suite with cylc stop u-bp881 --now --now and then restarted with rose suite-restart, it does not restart the restart from 18600701T0000Z. Thus, how do I restart the suite from 18600701T0000Z. Do I also need the previous time-step data as restart dump?

Thanks
Kenji

comment:8 Changed 3 months ago by ros

Hi Kenji,

I'm not sure why you wish to rerun that cycle. The 3 files were only corrupted on JASMIN when copied over from ARCHER. The version of them on ARCHER are the original non-corrupted files, so all you need to do is manually rsync or scp them from ARCHER to JASMIN. Then retrigger the pptransfer task to verify the checksums are now all ok.

Regards,
Ros.

P.S. rose suite-run --restart restarts a suite from where it left off.

comment:9 Changed 3 months ago by yb19052

Hi Ros,

I looked for the original files on NEXCS (/home/d03/kizumi/cylc-run/u-bp881/share/data/History_Data), but there are any files about 18600701T0000Z at the folder. Thus, I thought that I need to return that cycle.

Do we have the original data at another folder on NECXS?

Thanks
Kenji

comment:10 Changed 3 months ago by ros

Hi Kenji,

Yes the data is staged in a directory in your /projects disk area: /projects/nexcs-n02/kizumi/u-bp881 which is specified in postproc → post processing → Archer archiving → archive_root_path.

Once you are happy that all you data is on JASMIN you will need to remove it from NEXCS the system currently does not automatically remove data from NEXCS.

Regards,
Ros.

comment:11 Changed 3 months ago by yb19052

Hi Ros,

I appreciate your response. I found all necessary data there.

I tried to transfer data on NEXCS to JASMIN using scp, but I got an error, "Disk quota exceeded". Finally, the necessary data on NEXCS was transferred to JASMIN with sftp.

After data transfer, I get an error "kizumi@xcslc1:~> X11 connection rejected because of wrong authentication." on NECXS. How do I fix the issue?

when I typed "quota" on the terminal, I also get an error, "quota: error while getting quota from master:/cm/shared for kizumi (id 41143): Connection refused"

Thanks
Kenji

comment:12 Changed 3 months ago by ros

Hi Kenji,

It's unclear from your message where the disk quota issue is. Possibly the pmip4 GWS on JASMIN? I don't have access to that GWS so can't tell you what the quota is or how much is being used. If you think the problem is here and you still have the problem you will need to contact the GWS manager for pmip4.

On NEXCS the /projects/nexcs-n02 was around 95% full when I looked. You can find this out by running the script quota.py -g nexcs-n02 lustre_multi.

Regards,
Ros.

comment:13 Changed 3 months ago by yb19052

Hi Ros,

About data transfer to JASMIN,
I tried to transfer the three outputs into PUMA with 'scp' first, but I got a (not error?) message “Disk quota exceeded” at this step. Then, the data were transferred to JASMIN from PUMA, but they were incomplete files. Thus, I tried to transfer the files into JASMIN with 'sftp' directly and did not get any messages.

Now I try to run the rosie GUI ('rosie go') on PUMA, but I get the same message, "X11 connection rejected because of wrong authentication." Thus, the errors might come from the PUMA.

Thanks
Kenji

comment:14 Changed 3 months ago by ros

Hi Kenji,

Yes you have hit your PUMA quota. I have increased it slightly as you only have 1 Gb, however please DO NOT copy/route data through PUMA, it is not designed for this and does not have the disk space. You need to go direct from NEXCS to JASMIN using the JASMIN transfer servers (jasmin-xfer1[2].ceda.ac.uk) which you can do using rsync, scp or sftp (https://help.jasmin.ac.uk/article/3810-data-transfer-tools-rsync-scp-sftp).

Hope that helps.

Regards,
Ros.

comment:15 Changed 3 months ago by yb19052

Hi Ros,

I appreciate your help. Now I can run the Cylc GUI on NEXCS.
The 'pptransfer' task is retriggered, but the task is "submit-failed"

job.err says that "ERROR: file not found: /home/d03/kizumi/cylc-run/u-bp881/log/job/18600701T0000Z/pptransfer/26/job.err
ERROR: command terminated by signal 1: ssh -oBatchMode=yes -oConnectTimeout=8 -oStrictHostKeyChecking=no -n xcs-c env CYLC_VERSION=7.8.3 bash —login -c "'"'exec "$0" "$@"'"'" cylc cat-log '—remote-arg='"'"'$HOME/cylc-run/u-bp881/log/job/18600701T0000Z/pptransfer/26/job.err'"'" —remote-arg=tail '—remote-arg='"'"'tail -n +1 -F %(filename)s'"'" u-bp881"

How do I fix the issue?

Thanks
Kenji

comment:16 Changed 3 months ago by ros

Hi Kenji,

This is caused by setting the host to be xcs-c which causes intermittent problems setting temporary as it cause cylc to login from the xcs-c to itself. This is a known problem and you need to change in the site/meto_cray.rc file in the [[HPC]] section replace the line:

host = $(rose host-select {{ HOST_XC40 }})

with

host = localhost

Then do rose suite-run --reload to pick up the changes and then retrigger the pptransfer.

Cheers,
Ros.

comment:17 Changed 3 months ago by yb19052

Hi Ros,
My suite, u-bp881 works well now. Thank you for your help.
Thanks
Kenji

comment:18 Changed 2 months ago by ros

  • Component changed from UM Model to Rose/Cylc
  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.