Opened 6 years ago
Closed 6 years ago
#1453 closed help (fixed)
monsoon to archer job
Reported by: | lboljka | Owned by: | ros |
---|---|---|---|
Component: | UMUI | Keywords: | monsoon->archer |
Cc: | Platform: | ARCHER | |
UM Version: | 8.6 |
Description
Hi
I have a question regarding the conversion of job from monsoon to archer. I have been following the instructions on http://cms.ncas.ac.uk/wiki/Faq/ConvertMonsoonJobToArcher, but the job is still not working.
I also had problems with specifying the following:
- tic code (maybe not necessary)
- number of processors
- Set required compile type and compile time limit
- Section 96 should be set to 1A [cannot choose 1A instead of 1C]
- Post processing: Set up as required for archiving
- Time Convention and SCRIPT environment variables: Adjust settings for $DATAM and $DATAW as required. [this may be ok]
- If using reconfiguration, set to run on same number of processors as model
- Input Files: Copy over any non-standard ancillary files and start dumps. & Check paths to input files (ancils, dumps, etc) are correct.
I am attaching the errors that I receive (one in UM interface after submission and another when opening ext.out). It is not able to find container.cfg file.
job in question is xkwjb from lboljka that is a copy of xjwoa (from till).
Being a beginner I may have made a few mistakes along the way, but generally I have only changed what was written in the given link.
Thank you for help.
Best regards
Lina
Attachments (9)
Change History (30)
Changed 6 years ago by lboljka
comment:1 Changed 6 years ago by ros
- Owner changed from um_support to ros
- Status changed from new to accepted
Hi Lina,
The extract is failing because the Bindings locations is wrong.
Go to window FCM configuration → FCM options for UM atmos & recon and change the bindings location to be
fcm:um_br/dev/um/vn8.6_machine_cfg/src/configs/bindings
I.e. vn8.6_machine_cfg not VN8.6_machine_cfg
In window User information → General details you need to change the account code (aka TIC code) to be your ARCHER account code n02-ncas.
The number of processors you are running on should be a multiple of 24 as there are 24 cores per node on ARCHER. See window User information → Submit Method
Give that a try.
Regards,
Ros.
comment:2 Changed 6 years ago by lboljka
Hi Ros
Thank you for your answer. I have changed what you have suggested, but am now getting some permission errors.
What I have changed:
- bindings location as suggested in your comment
- account code to n02-ncas
- in Job submission method I have changed number of processes for ATMOS in east-west to 48 and in north-south to 24.
Do I have to copy ancillary files or something? (how do I do it if I have to?)
I am attaching the errors I received.
Thank you.
Lina
comment:3 Changed 6 years ago by ros
Hi Lina,
You need to make sure you can log in from PUMA to ARCHER without a prompt for password or passphrase. I suspect you haven't set up your ssh-keys to work with your ARCHER account since the training course. If this is the case you need to follow points 2, 3, 4 and 6 of the "setting up ssh-agent" instructions https://puma.nerc.ac.uk/trac/UM_TUTORIAL/wiki/Ros/sshAgent
If when you run ssh-add you get the error message could not open a connection to your authentication agent please follow the advice in the FAQ - http://cms.ncas.ac.uk/wiki/FAQ_T4_F5
Please could you also change the permissions on your ARCHER directories so that we can see them and help in the event of any further problems.
chmod -R g+rX /home/n02/n02/lboljka chmod -R g+rX /work/n02/n02/lboljka
Regards,
Ros.
comment:4 Changed 6 years ago by lboljka
Hi Ros
I have follow instructions (can enter archer without a prompt for password/passphrase), but am still getting similar error as before (attached - see the second version) - still permission denied… Now that I think about it I may have changed passphrase (but cannot remember - I hope I have not; then again it would probably tell me it was wrong?).
I have changed permissions on Archer as you suggested.
Thank you again.
Best Regards
Lina
Changed 6 years ago by lboljka
Changed 6 years ago by lboljka
comment:5 Changed 6 years ago by ros
Hi Lina,
This is a permission denied error from the mkdir command trying to create a directory. /home/lboljka does not exist. In UMUI window FCM configuration → FCM Extract directories you need to change Target machine root extract directory to be your home directory on ARCHER i.e. /home/n02/n02/lboljka/um
Regards,
Ros.
comment:6 Changed 6 years ago by lboljka
Hi Ros
Now somehow I cannot log into Archer. I get error:
"ssh_exchange_identification: Connection closed by remote host"
But just before I got another error in the job I am running, this way around in the Jules component of it (attached).
Thank you in advance.
Best regards
Lina
Changed 6 years ago by lboljka
comment:7 Changed 6 years ago by ros
Hi Lina,
ARCHER is having problems at the moment. You should have received an email earlier this morning saying that there is currently a problem and users are unable to login to ARCHER. They will send round an email when service is restored. You can also keep an eye on the ARCHER website for further news and service status.
Regards,
Ros.
comment:8 Changed 6 years ago by lboljka
Hi Ros
I am very sorry, I did not receive any email regarding Archer problems yesterday.
Later on yesterday I was able to submit the job (reconfiguration only). Looking at .comp.leave file no problems were found, however looking at .rcf.leave file I received the error that some script is not found (attached).
I tried turning off the script inserts and modifications under the input/output control and now I do not get the .rcf.leave file anymore (at some point I was getting problems with Jules component and it would not ever run - but maybe it was temporary). Actually only the first run had .rcf.leave file and all the rest did not. Is that ok?
Thank you for everything!!!!
Best regards
Lina
Changed 6 years ago by lboljka
comment:9 Changed 6 years ago by ros
Hi Lina,
You definitely need the script inserts turned off and you will also need to set GCOM collectives limit to 1 in Independent section options → misc sections 95, 96, 97 & 98. I think your job probably hung, hence no .rcf.leave file.
However, I've just realised that this is the standard antie job which we already have up and running on ARCHER. It'll probably be easier just to copy my xjvpu job. Alternatively, if you really want to get your job running difference it in the UMUI to mine and hopefully you will
be able to see all the remaining changes you need to make to get it to run.
Regards,
Ros.
comment:10 Changed 6 years ago by lboljka
Hi Ros
Thank you very much!!!! I will look into your job and see the differences. I guess it is my fault for not mentioning it earlier…
(I turned that script option off to see what happens - as the model could not find the file)
Thank you again for all the help!!!!
Best regrds
Lina
comment:11 Changed 6 years ago by lboljka
Hi Ros
I would just like to thank you again for all the help. Your job has now run successfully.
Best wishes
Lina
comment:12 Changed 6 years ago by lboljka
Hi Ros
I do get one issue though. When trying to see the model output in files like xkwjea.pa1988sep with xconv on my work directory it says that it is a "byte swapped 64 bit ieee um file" and does not show anything, while the .astart file works fine; but there also is no xxx.daxxxx file. The job is now xkwje. The .leave files did not seem to give any errors except some warnings. Can I see the model output in some way?
Path: archer$ /work/n02/n02/lboljka/um/xkwje
Sorry for a new message.
Thank you again.
Lina
comment:13 Changed 6 years ago by ros
Hi Lina,
The job was only set to run for one day so that is why you don't have any xxx.daxxxx file. It is set to produce a dump file every 10 days, so if you change the Run Length (Input/output control and resources → Start date and Run length) to longer than 10 days you will get these files.
Also if you want Climate Meaning, you will need to turn this on by selecting Defining a meaning sequence in Atmosphere → Control → Post processing, dumping & meaning → Dumping and meaning
Regards,
Ros.
comment:14 Changed 6 years ago by lboljka
Hi Ros
I have tried running it for 11 days now, but I receive error (attached). I also got some errors for meaning run, but maybe I should not get into that just yet (have to make the 11day run work first).
Thank you.
Best regards
Lina
Changed 6 years ago by lboljka
comment:15 Changed 6 years ago by ros
Hi Lina,
That's an intermittent error we have seen when the code is being copied over, but we are still working to get to the bottom of. Usually if you try submitting again it will work.
If you are only changing the run length and other run time options you don't need to recompile the model executable or reconfiguration. So if you switch the compilations off you won't encounter the above problem.
Cheers,
Ros.
comment:16 Changed 6 years ago by lboljka
Hi Ros
When I try running the full model for 11 days (including compilation) - sometimes even for a full run for 1day - I get error in terminal that my quota/limit has been exceeded (attached). Do I have to delete files in puma or I do not have enough space or something else?
Thank you.
Best regards
Lina
Changed 6 years ago by lboljka
Changed 6 years ago by lboljka
comment:17 Changed 6 years ago by lboljka
Hi Ros
Earlier I have also tried running xkwje with no compilation and just 11 days long and I did not get additional dump files (xxx.daxxxx)…
Lina
comment:18 Changed 6 years ago by ros
Hi Lina,
Yes you have run out of disk space on PUMA. You need to delete some files. Your UM extract directories are building up and require deletion from time to time. Please delete the directories under /home/lboljka/um/um_extracts. They can safely be deleted and will be recreated as required by the UMUI.
The reason you did not get any additional dumps when you re-ran for 11 days is because the run did not complete successfully. See the .leave file. You should always check that it says the run has completed successfully. I'll take a look and get back to you.
Cheers,
Ros.
comment:19 Changed 6 years ago by ros
Hi Lina,
I've sorted out the segmentation fault. In window FCM Configuration → FCM options for JULES & FLAKE please switch on Jeremy's um8.6_emis_ice_fix_UKMO branch by changing the 'N' to a 'Y' in the last column. You will then need to recompile the model executable before re-running.
Regards,
Ros.
comment:20 Changed 6 years ago by lboljka
Hi Ros
Thank you very much. It all seems to be working now.
Best regards
Lina
comment:21 Changed 6 years ago by ros
- Resolution set to fixed
- Status changed from accepted to closed
errors I receive