Opened 5 years ago

Closed 5 years ago

#1453 closed help (fixed)

monsoon to archer job

Reported by: lboljka Owned by: ros
Component: UMUI Keywords: monsoon->archer
Cc: Platform: ARCHER
UM Version: 8.6

Description

Hi

I have a question regarding the conversion of job from monsoon to archer. I have been following the instructions on http://cms.ncas.ac.uk/wiki/Faq/ConvertMonsoonJobToArcher, but the job is still not working.
I also had problems with specifying the following:

  • tic code (maybe not necessary)
  • number of processors
  • Set required compile type and compile time limit
  • Section 96 should be set to 1A [cannot choose 1A instead of 1C]
  • Post processing: Set up as required for archiving
  • Time Convention and SCRIPT environment variables: Adjust settings for $DATAM and $DATAW as required. [this may be ok]
  • If using reconfiguration, set to run on same number of processors as model
  • Input Files: Copy over any non-standard ancillary files and start dumps. & Check paths to input files (ancils, dumps, etc) are correct.

I am attaching the errors that I receive (one in UM interface after submission and another when opening ext.out). It is not able to find container.cfg file.

job in question is xkwjb from lboljka that is a copy of xjwoa (from till).

Being a beginner I may have made a few mistakes along the way, but generally I have only changed what was written in the given link.

Thank you for help.

Best regards

Lina

Attachments (9)

error_xkwjb.png (42.1 KB) - added by lboljka 5 years ago.
errors I receive
error.pdf (38.6 KB) - added by lboljka 5 years ago.
errors I receive after first suggested corrections
error_after_2ndCorrection.png (86.5 KB) - added by lboljka 5 years ago.
error_after_2ndCorrection.2.png (98.6 KB) - added by lboljka 5 years ago.
error_3rdCorrections.png (74.3 KB) - added by lboljka 5 years ago.
problems_after_reconf_run.png (44.5 KB) - added by lboljka 5 years ago.
error_11dayRun.png (97.1 KB) - added by lboljka 5 years ago.
user_block_quota_limit_reached.png (10.5 KB) - added by lboljka 5 years ago.
ghui_error_user_block_limit_reached.png (6.6 KB) - added by lboljka 5 years ago.

Download all attachments as: .zip

Change History (30)

Changed 5 years ago by lboljka

errors I receive

comment:1 Changed 5 years ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Lina,

The extract is failing because the Bindings locations is wrong.

Go to window FCM configuration → FCM options for UM atmos & recon and change the bindings location to be

fcm:um_br/dev/um/vn8.6_machine_cfg/src/configs/bindings

I.e. vn8.6_machine_cfg not VN8.6_machine_cfg

In window User information → General details you need to change the account code (aka TIC code) to be your ARCHER account code n02-ncas.

The number of processors you are running on should be a multiple of 24 as there are 24 cores per node on ARCHER. See window User information → Submit Method

Give that a try.
Regards,
Ros.

comment:2 Changed 5 years ago by lboljka

Hi Ros

Thank you for your answer. I have changed what you have suggested, but am now getting some permission errors.
What I have changed:

  • bindings location as suggested in your comment
  • account code to n02-ncas
  • in Job submission method I have changed number of processes for ATMOS in east-west to 48 and in north-south to 24.

Do I have to copy ancillary files or something? (how do I do it if I have to?)

I am attaching the errors I received.

Thank you.

Lina

Last edited 5 years ago by lboljka (previous) (diff)

Changed 5 years ago by lboljka

errors I receive after first suggested corrections

comment:3 Changed 5 years ago by ros

Hi Lina,

You need to make sure you can log in from PUMA to ARCHER without a prompt for password or passphrase. I suspect you haven't set up your ssh-keys to work with your ARCHER account since the training course. If this is the case you need to follow points 2, 3, 4 and 6 of the "setting up ssh-agent" instructions https://puma.nerc.ac.uk/trac/UM_TUTORIAL/wiki/Ros/sshAgent

If when you run ssh-add you get the error message could not open a connection to your authentication agent please follow the advice in the FAQ - http://cms.ncas.ac.uk/wiki/FAQ_T4_F5

Please could you also change the permissions on your ARCHER directories so that we can see them and help in the event of any further problems.

chmod -R g+rX /home/n02/n02/lboljka
chmod -R g+rX /work/n02/n02/lboljka

Regards,
Ros.

comment:4 Changed 5 years ago by lboljka

Hi Ros

I have follow instructions (can enter archer without a prompt for password/passphrase), but am still getting similar error as before (attached - see the second version) - still permission denied… Now that I think about it I may have changed passphrase (but cannot remember - I hope I have not; then again it would probably tell me it was wrong?).

I have changed permissions on Archer as you suggested.

Thank you again.

Best Regards

Lina

Last edited 5 years ago by lboljka (previous) (diff)

Changed 5 years ago by lboljka

Changed 5 years ago by lboljka

comment:5 Changed 5 years ago by ros

Hi Lina,

This is a permission denied error from the mkdir command trying to create a directory. /home/lboljka does not exist. In UMUI window FCM configuration → FCM Extract directories you need to change Target machine root extract directory to be your home directory on ARCHER i.e. /home/n02/n02/lboljka/um

Regards,
Ros.

comment:6 Changed 5 years ago by lboljka

Hi Ros

Now somehow I cannot log into Archer. I get error:
"ssh_exchange_identification: Connection closed by remote host"
But just before I got another error in the job I am running, this way around in the Jules component of it (attached).

Thank you in advance.

Best regards
Lina

Changed 5 years ago by lboljka

comment:7 Changed 5 years ago by ros

Hi Lina,

ARCHER is having problems at the moment. You should have received an email earlier this morning saying that there is currently a problem and users are unable to login to ARCHER. They will send round an email when service is restored. You can also keep an eye on the ARCHER website for further news and service status.

Regards,
Ros.

comment:8 Changed 5 years ago by lboljka

Hi Ros

I am very sorry, I did not receive any email regarding Archer problems yesterday.
Later on yesterday I was able to submit the job (reconfiguration only). Looking at .comp.leave file no problems were found, however looking at .rcf.leave file I received the error that some script is not found (attached).
I tried turning off the script inserts and modifications under the input/output control and now I do not get the .rcf.leave file anymore (at some point I was getting problems with Jules component and it would not ever run - but maybe it was temporary). Actually only the first run had .rcf.leave file and all the rest did not. Is that ok?

Thank you for everything!!!!

Best regards

Lina

Changed 5 years ago by lboljka

comment:9 Changed 5 years ago by ros

Hi Lina,

You definitely need the script inserts turned off and you will also need to set GCOM collectives limit to 1 in Independent section options → misc sections 95, 96, 97 & 98. I think your job probably hung, hence no .rcf.leave file.

However, I've just realised that this is the standard antie job which we already have up and running on ARCHER. It'll probably be easier just to copy my xjvpu job. Alternatively, if you really want to get your job running difference it in the UMUI to mine and hopefully you will
be able to see all the remaining changes you need to make to get it to run.

Regards,
Ros.

comment:10 Changed 5 years ago by lboljka

Hi Ros

Thank you very much!!!! I will look into your job and see the differences. I guess it is my fault for not mentioning it earlier…
(I turned that script option off to see what happens - as the model could not find the file)

Thank you again for all the help!!!!

Best regrds

Lina

comment:11 Changed 5 years ago by lboljka

Hi Ros

I would just like to thank you again for all the help. Your job has now run successfully.

Best wishes

Lina

comment:12 Changed 5 years ago by lboljka

Hi Ros

I do get one issue though. When trying to see the model output in files like xkwjea.pa1988sep with xconv on my work directory it says that it is a "byte swapped 64 bit ieee um file" and does not show anything, while the .astart file works fine; but there also is no xxx.daxxxx file. The job is now xkwje. The .leave files did not seem to give any errors except some warnings. Can I see the model output in some way?

Path: archer$ /work/n02/n02/lboljka/um/xkwje

Sorry for a new message.

Thank you again.

Lina

Last edited 5 years ago by lboljka (previous) (diff)

comment:13 Changed 5 years ago by ros

Hi Lina,

The job was only set to run for one day so that is why you don't have any xxx.daxxxx file. It is set to produce a dump file every 10 days, so if you change the Run Length (Input/output control and resources → Start date and Run length) to longer than 10 days you will get these files.

Also if you want Climate Meaning, you will need to turn this on by selecting Defining a meaning sequence in Atmosphere → Control → Post processing, dumping & meaning → Dumping and meaning

Regards,
Ros.

comment:14 Changed 5 years ago by lboljka

Hi Ros

I have tried running it for 11 days now, but I receive error (attached). I also got some errors for meaning run, but maybe I should not get into that just yet (have to make the 11day run work first).

Thank you.

Best regards

Lina

Changed 5 years ago by lboljka

comment:15 Changed 5 years ago by ros

Hi Lina,

That's an intermittent error we have seen when the code is being copied over, but we are still working to get to the bottom of. Usually if you try submitting again it will work.

If you are only changing the run length and other run time options you don't need to recompile the model executable or reconfiguration. So if you switch the compilations off you won't encounter the above problem.

Cheers,
Ros.

comment:16 Changed 5 years ago by lboljka

Hi Ros

When I try running the full model for 11 days (including compilation) - sometimes even for a full run for 1day - I get error in terminal that my quota/limit has been exceeded (attached). Do I have to delete files in puma or I do not have enough space or something else?

Thank you.

Best regards

Lina

Changed 5 years ago by lboljka

Changed 5 years ago by lboljka

comment:17 Changed 5 years ago by lboljka

Hi Ros

Earlier I have also tried running xkwje with no compilation and just 11 days long and I did not get additional dump files (xxx.daxxxx)…

Lina

comment:18 Changed 5 years ago by ros

Hi Lina,

Yes you have run out of disk space on PUMA. You need to delete some files. Your UM extract directories are building up and require deletion from time to time. Please delete the directories under /home/lboljka/um/um_extracts. They can safely be deleted and will be recreated as required by the UMUI.

The reason you did not get any additional dumps when you re-ran for 11 days is because the run did not complete successfully. See the .leave file. You should always check that it says the run has completed successfully. I'll take a look and get back to you.

Cheers,
Ros.

comment:19 Changed 5 years ago by ros

Hi Lina,

I've sorted out the segmentation fault. In window FCM Configuration → FCM options for JULES & FLAKE please switch on Jeremy's um8.6_emis_ice_fix_UKMO branch by changing the 'N' to a 'Y' in the last column. You will then need to recompile the model executable before re-running.

Regards,
Ros.

comment:20 Changed 5 years ago by lboljka

Hi Ros

Thank you very much. It all seems to be working now.

Best regards

Lina

comment:21 Changed 5 years ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.