Opened 3 years ago

Closed 3 years ago

#2004 closed help (fixed)

Reconfiguration Randomly Failing

Reported by: s1374103 Owned by: um_support
Component: UM Model Keywords: reconfiguration
Cc: Platform: MONSooN
UM Version: 8.4

Description

Dear Helpdesk,

The job I am developing is behaving quite erratically. Sometimes it runs perfectly, sometimes it fails at reconfiguration.

[NID 00105] 2016-10-21 10:17:29 Exec /projects/ukca-ed/jakel/xmyvw/bin/qxreconf failed: chdir /work/scratch/jtmp/pbs.1067712.xcm00.x8z No such file or directory
/projects/ukca-ed/jakel/xmyvw/bin/qsrecon: Error in dump reconfiguration - see OUTPUT

If it fails at reconfiguration I re-submit the job and after a couple of attempts it usually runs. However, sometimes it does not run. In that case I copy to a new job and repeat the process. Should I be worried about this?

Base job - xlsjc - vn8.4 RJ4.0 CheST+GLOMAP-mode (release version)

my job - xmyvv - nudging, changes to chemistry scheme/emissions, GLOMAP-mode MS4 confiuration.

Regards,

Jamie

Change History (6)

comment:1 in reply to: ↑ description Changed 3 years ago by s1374103

  • Platform set to MONSooN

Replying to s1374103:

Dear Helpdesk,

The job I am developing is behaving quite erratically. Sometimes it runs perfectly, sometimes it fails at reconfiguration.

[NID 00105] 2016-10-21 10:17:29 Exec /projects/ukca-ed/jakel/xmyvw/bin/qxreconf failed: chdir /work/scratch/jtmp/pbs.1067712.xcm00.x8z No such file or directory
/projects/ukca-ed/jakel/xmyvw/bin/qsrecon: Error in dump reconfiguration - see OUTPUT

If it fails at reconfiguration I re-submit the job and after a couple of attempts it usually runs. However, sometimes it does not run. In that case I copy to a new job and repeat the process. Should I be worried about this?

Base job - xlsjc - vn8.4 RJ4.0 CheST+GLOMAP-mode (release version)

my job - xmyvv - nudging, changes to chemistry scheme/emissions, GLOMAP-mode MS4 confiuration.

Regards,

Jamie

comment:2 Changed 3 years ago by willie

Hi Jamie,
In fact, it consistently fails with a message like,

Exec /projects/umadmin/wmcgin/xmyze/bin/qxreconf failed: chdir /work/scratch/jtmp/pbs.107478
8.xcm00.x8z No such file or directory

However, it does create a umui_submit_rcf script in the ~/umui_runs/xmyvv… directory which does work - you just qsub this. So there is no need for job copying.

We think this is a problem with MONSooN and we're still investigating.

Regards,
Willie

comment:3 Changed 3 years ago by willie

Hi Jamie,
If you switch off "Use different version of the UM code …" in the FCM options page it should work straight through.

Regards
Willie

comment:4 Changed 3 years ago by s1374103

Hi Willie,

When you say it create a umui_submit_rcf script and that I should submit, do I just type 'qsub ~umui_runs/xmyvv' in Puma at the command line? Is this the equivalent of sibmitting through the UMUI?

Also, for a little while after you replied to my message the model was submitting fine. Now, when I submit through the UMUI I am getting the following message

Initialising SUBMIT...
Writing remote commands file...
Calling MAIN_SCR - local...
(This may take several minutes.)

MAIN_SCR: Calling Extract ...
Extracting UMATMOS base repository...
UMATMOS base repository extract is OK
Extracting JULES base repository...
JULES base repository extract is OK
created umscripts sub-directory.
Extracting UMSCRIPTS including any branches...
UMSCRIPTS extract is OK
created umatmos sub-directory.
Extracting UMATMOS including any branches...
UMATMOS extract is OK
created umrecon sub-directory.
Extracting UMRECON including any branches...
UMRECON extract is OK
MAIN_SCR: Extract OK
MAIN_SCR: Submit OK
Logging in to remote machines lander.monsoon-metoffice.co.uk and xcml00...


key_read: uudecode b2:06:df:3e:f5:e4:c9:5d:4d:1f:17:4d:89:1c:90:72  AAAAB3NzaC1yc2EAAAABIwAAAIEArb08RIqZgsa02Lj9pGCxwOOZ2NRRQrKKL/foZF47IkDtgepcyNIy9H4YJkry+grlGoimoMf6qab/ToRpXfzrcTqdI8yygOLxPctI8moOGI5SO4yq+LQ94fk8MlHe69sdmBNdCoIrlRcZo9BJlOr91ibqKR+NlyVC72l+QryJ7Zk=
 failed
key_read: uudecode b2:06:df:3e:f5:e4:c9:5d:4d:1f:17:4d:89:1c:90:72  AAAAB3NzaC1yc2EAAAABIwAAAIEArb08RIqZgsa02Lj9pGCxwOOZ2NRRQrKKL/foZF47IkDtgepcyNIy9H4YJkry+grlGoimoMf6qab/ToRpXfzrcTqdI8yygOLxPctI8moOGI5SO4yq+LQ94fk8MlHe69sdmBNdCoIrlRcZo9BJlOr91ibqKR+NlyVC72l+QryJ7Zk=
 failed
key_read: uudecode b2:06:df:3e:f5:e4:c9:5d:4d:1f:17:4d:89:1c:90:72  AAAAB3NzaC1yc2EAAAABIwAAAIEArb08RIqZgsa02Lj9pGCxwOOZ2NRRQrKKL/foZF47IkDtgepcyNIy9H4YJkry+grlGoimoMf6qab/ToRpXfzrcTqdI8yygOLxPctI8moOGI5SO4yq+LQ94fk8MlHe69sdmBNdCoIrlRcZo9BJlOr91ibqKR+NlyVC72l+QryJ7Zk=
 failed

REMCOMMS                                      100% 9567     9.3KB/s   00:00    
Creating directory...
Copying job files...

Renaming SUBMIT...
Changing SUBMIT permissions...
Running SUBMIT script...

Your job directory on host xcml00 is: /home/jakel/umui_runs/xmzmd-307095248

/home/jakel/umui_runs/xmzmd-307095248/SUBMIT[28]: .: /home/jakel/.profile: cannot open [No such file or directory]

Copying files to directory /projects/ukca-ed/jakel/xmzmd/baserepos/UMATMOS using rsync...
See /projects/ukca-ed/jakel/xmzmd/baserepos/UMATMOS/ext.out for output



Copying files to directory /projects/ukca-ed/jakel/xmzmd/baserepos/JULES using rsync...
See /projects/ukca-ed/jakel/xmzmd/baserepos/JULES/ext.out for output



Copying files to directory /projects/ukca-ed/jakel/xmzmd/umscripts using rsync...
See /projects/ukca-ed/jakel/xmzmd/umscripts/ext.out for output



Copying files to directory /projects/ukca-ed/jakel/xmzmd/umatmos using rsync...
See /projects/ukca-ed/jakel/xmzmd/umatmos/ext.out for output



Copying files to directory /projects/ukca-ed/jakel/xmzmd/umrecon using rsync...
See /projects/ukca-ed/jakel/xmzmd/umrecon/ext.out for output



Connection to xcml00 closed.
Connection to lander.monsoon-metoffice.co.uk closed.

Tidying local directories...
Job submission completed

Do you understand this?

Regards,

Jamie

comment:5 Changed 3 years ago by willie

Hi Jamie,

You don't need to qsub manually if you switch off "Use different version of the UM code …" in the FCM options page.

Is xmzmd actually failing? I couldn't find the leave file.

Willie

comment:6 Changed 3 years ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.