Opened 4 years ago

Closed 4 years ago

#1764 closed help (answered)

basename: missing operand, ambiguous error message (urgent)

Reported by: dan2012 Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: MONSooN
UM Version: 8.4

Description

Dear CMS,

I have been trying to re-submit a simulation that previously ran for 1 month and archived to moose successfully today, however, now it fails with an ambiguous error:

xlyoj000.xlyoj.d15339.t013040.leave
basename: missing operand
Try `basename —help' for more information.

This continued until the job ran out of time.

I have not changed anything in the setup, except the run length from my job which ran sucessfully and archived to moose (XLYOA).

I do not expect anyone to pick this up now until Monday, however, if there is a quick solution I would greatly appreciate it if you could let me know as I am working against a Monday deadline to provide results for the first year of the simulation.

Many thanks in advance for your time
Daniel

Change History (9)

comment:1 Changed 4 years ago by dan2012

I noticed that in the project group name in the archiving panel (Model Selection →
Post Processing → Main Switch + General Questions) I have "ukca-meto" set, instead of "ukca-ox".

I am testing with this changed now. However, if this is the problem, then I am at a loss how the original job (xlyoa) ran successfully.

Best,
Daniel

comment:2 Changed 4 years ago by dan2012

This change did not resolve the crash.
Best
Dan

comment:3 Changed 4 years ago by dan2012

However, reducing the resubmit window in the UMUI from 20days to 10days has resolved the issue.
I do not understand why the job is running at half the speed as when I last ran it..
At this speed it will unfortunately be very difficult to have results in time.
Best,
Daniel

comment:4 Changed 4 years ago by dan2012

Hi CMS,
To conclude, the job currently running (XLYOO), is currently running half the speed of the job I ran a few weeks ago (XLYOA -1 month archived on moose). There are some minor changes between these jobs, however, I have also made a test with no changes and the performance degradation still exists.
The good news is, that they are running and the crash is solved.

If you could let me know asap if there are any settings that need changing currently on the UMUI to resolve this speed issue so that I can restart the remainder of the simulation I would be very grateful.

Best,
Daniel

comment:5 Changed 4 years ago by grenville

Daniel

We'd need to reproduce the slow down to have a chance at understanding why it is happening. You could switch on subroutine timers in future - that would show you how individual routines are running.

Grenville

comment:6 Changed 4 years ago by ros

  • Status changed from new to pending

Hi Daniel,

We have put a fix into the UMUI which should solve the slow running of your jobs. Please try running your job again and let us know how you get on.

If you are not recompiling your model executable you will need to go into the UMUI to window Compilation and Run options → UM Scripts Build and switch on "Enable build of UM scripts" to pick up the script changes. Save, Process and Submit as usual. This only needs to be done once and can be switched off for subsequent submissions. Please also make sure that you are not specifying a revision number for the branch fcm:um-br/pkg/Config/vn8.4_ncas in order to pick up the new changes.

Regards,
Ros.

(Helpdesk note: See also #1766)

Last edited 4 years ago by ros (previous) (diff)

comment:7 Changed 4 years ago by dan2012

Hi Ros,
Thanks for working on this.

I have not had time to test thoroughly as quite pressed with work before the holidays start, but thought you should know that I did start one test which crashed during compilation:
xlyox000.xlyox.d15352.t100547.rcf.leave

Best regards,
Dan

comment:8 Changed 4 years ago by annette

Hi Dan,

This is due to the fix we applied for the slowdown. The key message in the leave file is:

aprun: -N cannot exceed -n

This occurs because the reconfiguration in your job is set to use 4 MPI tasks, but the UMUI scripts now pass 32 MPI tasks per node to aprun (these are the -N and -n in the message).

For now the best thing to do is set your job to use 32 MPI tasks for the reconfiguration. Go to Reconfiguration → General reconfiguration options and set the number of processors EW to be 4, and the processors NS to be 8.

See also ticket:1758#comment:10.

Annette

comment:9 Changed 4 years ago by ros

  • Resolution set to answered
  • Status changed from pending to closed

Closing due to lack of activity.

Note: See TracTickets for help on using tickets.