Opened 10 years ago

Closed 9 years ago

#544 closed error (fixed)

Error in submission box

Reported by: a.elvidge Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version: 7.1

Description (last modified by ros)

Hi,

Since Lois topped up the n02-ncas budget (ticket #543), I am unable to run my job. In the submission box I get the following error:

FCM_MAIN: Calling Extract ...
Base extract: failed
FCM_MAIN: Extract failed
Tidying up directories ...
FCM_MAIN stopped with return code 255

I havent had this before. The job was working before the n02 time budget ran out, and Im sure I havent edited it at all since it was working. Any ideas?

Cheers, Andy

Thanks, Andy

Change History (5)

comment:1 Changed 10 years ago by ros

  • Description modified (diff)

Hi Andy,

If you look in your Extract output file /home/a.elvidge/um/um_extracts/xenoc/umbase/ext.out on PUMA you will see a message at the bottom which says:

Error: Timed out, host not responding

We get this messsage every now and then when the connection to HECToR is slow. Part of the job submission phase carries out a file comparison to work out which files need updating on HECToR. When the connection is slow this phase can fall over. I've been getting this problem on and off this afternoon too and am currently investigating. In the meantime I have made a modification to the FCM scripts to lengthen the timeout threshold. Please try submitting again and hopefully it will work this time.

If you don't mind your job doing a clean compile from scratch then you could remove the $DATADIR/$RUNID/umbase and /ummodel directories on HECToR as this will cause the submission to skip the file comparison step and simply copy over all required files from PUMA to HECToR.

Regards,
Ros.

comment:2 Changed 10 years ago by a.elvidge

Thanks Ros,

Unfortunately Ive tried it 3 or 4 times more.. and still no success.

Andy

comment:3 Changed 10 years ago by a.elvidge

It worked!

Cheers, Andy

comment:4 Changed 10 years ago by lois

Great, Ros said it had improved about 4.30pm

Ros is monitoring this to see if we can get to the cause with HECToR so let her know if it fails again.

We have had problem intermittently with this and never had a satisfactory explanantion - annoying!

Lois

comment:5 Changed 9 years ago by ros

  • Resolution set to fixed
  • Status changed from new to closed

Following investigations and suggestions from HECToR the following advice was issued to the PUMA mailing list in February and is copied here for completeness.

The performance of the /work filesystem on HECToR is causing some people to experience problems submitting compilation jobs to HECToR - symptoms are: very slow job submission from PUMA, at worse it may timeout, and/or very slow compilation times. If you are experiencing these problems, then we suggest you try compiling the job on the /home filesystem which is not affected by filesystem load in the same way as lustre is.

To compile on /home you will need to make the following change in the UMUI window "FCM Configuration —> FCM Extract and Build Directories":

Change "Target machine root extract directory (UM_ROUTDIR)" to point to a location in your /home directory.

Your job will then compile in /home. The location of the executable, scripts and run output files will remain unchanged; they will be placed in your specified /work directories as usual.

Note: See TracTickets for help on using tickets.