#2111 closed help (fixed)

Problems with fcm make on the XCS

Reported by: mhollaway Owned by: ros
Priority: normal Component: UKCA
Keywords: UKCA,XCS,FCM_MAKE Cc:
Platform: MONSooN UM Version: 10.6

Description

Hi,

I am currently experiencing issues with getting a job running on the new xcs machine on MONSooN. The job id is u-ak609 and the issue is with the fcm_make app resulting in the following error.

[FAIL] config-file=/projects/ukca-lan/mhollo/cylc-run/u-ak609/work/19880901T0000Z/fcm_make_um/fcm-make.cfg:3
[FAIL] config-file= - svn://puma.nerc.ac.uk/um.xm_svn/main/trunk/fcm-make/meto-xc40-cce/um-atmos-safe.cfg@31236
[FAIL] svn://puma.nerc.ac.uk/um.xm_svn/main/trunk/fcm-make/meto-xc40-cce/um-atmos-safe.cfg@31236: cannot load config file
[FAIL] svn://puma.nerc.ac.uk/um.xm_svn/main/trunk/fcm-make/meto-xc40-cce/um-atmos-safe.cfg@31236: not found
[FAIL] svn: Can't connect to host 'puma.nerc.ac.uk': Connection timed out

[FAIL] fcm make -f /projects/ukca-lan/mhollo/cylc-run/u-ak609/work/19880901T0000Z/fcm_make_um/fcm-make.cfg -C /home/d01/mhollo/cylc-run/u-ak609/share/fcm_make_um -j 6 # return-code=1
Received signal ERR
cylc (scheduler - 2017-03-17T10:02:22Z): CRITICAL Task job script received signal ERR at 2017-03-17T10:02:22Z
cylc (scheduler - 2017-03-17T10:02:22Z): CRITICAL failed at 2017-03-17T10:02:22Z

The error appears to reference a link to PUMA but I thought that all links now came through the roses-u repository?

As mentioned the job ran fine on the old xcm machine. I have only made the following changes to get in my attempts to run on the xcs.

1) In the sites/MONSooN.rc file I have changed the host from xcm to xcs and have also changed the number of nodes under the resources from 32 to 36 to reflect the changes on the new machine.

2) The old job on the xcm ran from a prebuild but this is not in place on the the machine. Therefore I also removed the link to this prebuild in the above file.

Could any of these changes be linked to the error I am having? Or is this just a result of the machine switchover. I have looked through all of the known issues on the collaboration wiki and cannot see any reference to this.

Best Wishes,

Michael.

Change History (8)

comment:1 Changed 11 months ago by ros

  • Owner changed from um_support to ros
  • Status changed from new to accepted

Hi Michael,

All suites do the code extraction from the local mirror repositories rather than direct from the MOSRS repository as it is quicker.

The more recent suites do a 1-step build where they extract the source code and build it on the shared nodes rather than extract on exvmsrose and then do the build on the shared nodes. Your suite is configured for a 1-step build and there is currently a configuration issue with the shared nodes on XCS-C where they can't see PUMA. The Met Office are working on it. Unfortunately it is impossible to extract from the MOSRS from the shared nodes as it is not possible to cache your password there.

I'll update you when I have further news.

Regards,
Ros.

comment:2 Changed 11 months ago by mhollaway

Hi Ros,

Thanks for the update. I suspected it may have had something to do with teething troubles on the switchover but just wanted to double check I had not made any silly mistakes from my end.

Ill watch this space for an update.

Cheers

Michael.

comment:3 Changed 11 months ago by ros

Hi Michael,

I believe the magic switch has been found. :-) My standalone subversion test works now. Could you try submitting your suite again please?

Cheers,
Ros.

comment:4 Changed 11 months ago by mhollaway

Hi Ros,

Thanks for the update, looks like the magic switch did the trick. My suite has now compiled successfully with no errors and is queueing to reconfigure :-)

I had an issue at first but realised I had forgotten to update the home directory my working copy was sitting in. Hopefully I have not missed updating any other links in the reconfiguration and atmos_main apps.

Thanks again for your help on this.

Cheers,

Michael.

comment:5 Changed 11 months ago by mhollaway

Hi Ros,

Don't know if this is related to the above issue. But I have found an issue with the reconfiguration step. It appears to be related to the issue raised in ticket #2096. As it happens the suite I am running is a copy of the one luke mentioned in that ticket.

My suite is reporting the same error on the XCS as Luke found on Archer. My job also originally ran fine on the xcm.

I have tried the fix listed in ticket #2096 but the reconfiguration step still seems to be having problems. Could there be something else I could have missed or could the issue be related to the new machine not being able to see the repositories?

Cheers,

Michael.

comment:6 Changed 11 months ago by ros

Hi Michael,

The fix listed in #2096 only applies to ARCHER as Monsoon2 can see the repositories ok.

Can you please point me to the log file with the error message in? I've looked at /home/d01/mhollo/cylc-run/u-ak609/log/job/19880901T0000Z/recon/01 and it says succeeded.

Cheers,
Ros.

comment:7 Changed 11 months ago by mhollaway

Hi Ros,

Apologies I think I may have deleted the file with the original error with a rose suite clean. As it happens the ARCHER fix seemed to have worked for this job and it ran successfully for a 3 month test run (I guess this is the log file you have seen). I'm guessing when I submitted the original job there may have just been an issue with the mirrors at that time and following the fix for ARCHER bypassed the issue?

Everything seems to be running fine now. Now for the fun of making sure my code is up to scratch :-)

Cheers

Michael.

comment:8 Changed 11 months ago by ros

  • Resolution set to fixed
  • Status changed from accepted to closed

Hi Michael,

Thanks for letting us know. What you say, I suspect is correct. And hopefully you shouldn't need to the workaround anymore. Good luck with the coding. I'll close this ticket now.

Regards,
Ros.

Note: See TracTickets for help on using tickets.