Opened 7 months ago

Closed 5 months ago

#3394 closed help (answered)

submission failed with spin up runs with dump file

Reported by: yaogao Owned by: jules_support
Component: JULES Keywords: submission failed
Cc: Platform:
UM Version:

Description

Hi Patrick,

I have run JULES vn5.8 on SLURM and got outputs.

However, due to the soil carbon is not spun up for the first run (8000 years), I want to do another spin up based on the dump file from the first run. Then, I got the job error "submission failed". I tried to check what is wrong but I cannot figure out. I attached the job.err message below. I think you also have the right to read my files now. The suite number is "u-bx728". Thanks for help!

Traceback (most recent call last):

File "/apps/contrib/metomi/cylc-7.8.1/bin/cylc-cat-log", line 439, in <module>

main()

File "/apps/contrib/metomi/cylc-7.8.1/bin/cylc-cat-log", line 435, in main

tmpfile_edit(out, options.geditor)

File "/apps/contrib/metomi/cylc-7.8.1/bin/cylc-cat-log", line 265, in tmpfile_edit

modtime1 = os.stat(tmpfile).st_mtime

TypeError?: coercing to Unicode: need string or buffer, int found

Change History (9)

comment:1 Changed 7 months ago by pmcguire

Hi Yao
Sometimes it just helps to try again, when you get a submission failed error.
Make sure your Xwindows is working, by starting an Xclock. And maybe you can make sure that you are logged in to MOSRS properly, if the suite needs that. You can also look at your log files in the ~/cylc-run/u-bx728/log directory. They may give different error messages than what you report here.
Patrick

comment:2 Changed 7 months ago by yaogao

Hi Patrick,

I have tried many times, but it is always submission failed. Xclock works, and MOSRS is connected. Build session worked, but submission failed with spin up session. The other error message from job.activity is like below, but I couldnot figure out what is it (e.g. which file is missing). The job file with the settings of the job looks right to me.

Traceback (most recent call last):

File "/apps/contrib/metomi/cylc-7.8.1/bin/cylc-cat-log", line 439, in <module>

main()

File "/apps/contrib/metomi/cylc-7.8.1/bin/cylc-cat-log", line 435, in main

tmpfile_edit(out, options.geditor)

File "/apps/contrib/metomi/cylc-7.8.1/bin/cylc-cat-log", line 268, in tmpfile_edit

proc = Popen(cmd, stderr=PIPE)

File "/usr/lib64/python2.7/subprocess.py", line 711, in init

errread, errwrite)

File "/usr/lib64/python2.7/subprocess.py", line 1327, in _execute_child

raise child_exception

OSError: [Errno 2] No such file or directory

comment:3 Changed 7 months ago by pmcguire

Hi Yao:
When I try to run my copy of your suite in ~pmcguire/roses/u-bx728ygao, I get similar errors.

I note that when I run your suite, there are files written out to my TMPDIR in the cylc* subdirectories , that contain error messages about the TMPDIR that are similar to (but longer than) what you see above.

echo $TMPDIR

/home/users/pmcguire/tmpdir
Do you have TMPDIR defined?

I also note that you are using 1 processor on the queue par-multi. Maybe that is inefficient. Maybe it would be better to use the queue short-serial or short-serial-4hr? There might be long waiting times in the short-serial queue right now but short-serial-4hr might be better. The par-multi queue is meant for multi-core jobs.

Patrick

comment:4 Changed 6 months ago by yaogao

Hi Patrick,

I found the problem is actually because JASMIN will be undergoing benchmark work on this Saturday 4pm -8pm and next Saturday. My job reservation period (24hours) goes over this benchmark work period, and that is why it is submitted failed. I changed the reservation period to be much less, then the job can run. There was an email about JASMIN benchmark work at 2pm yesterday. Such an experience! I should learn from it. Nevertheless, JASMIN team didnot send the email earlier enough either. I tried to submit the job on Thursday evening already (at that time my reservation period was 48 hours), and failed in submission.

Thank you for your help a lot!
Yao


comment:5 Changed 6 months ago by pmcguire

Hi Yao
I am glad it's working now.

Can you also consider using the short-serial queue instead of the par-multi queue? You're not doing parallel processing as far as I can see. Do you think this would work?

Patrick

comment:6 Changed 6 months ago by yaogao

Hi Patrick,

Thanks! I am using '—qos = short-serial' and '—partition = par-multi'. I will change par-multi to par-single, but I am using 'short-serial' queue already, isnot it?

Best regards,
Yao

comment:7 Changed 6 months ago by pmcguire

Hi Yao:
You had changed from --qos=long to --qos=short-serial since I last looked at your suite.

If I do: sacctmgr show qos, it suggests that short-serial doesn't exist as a qos. It is short or short-4hr instead.

The flag --partition is the one I was exclusively using before. I had not used --qos before. If both are specified, then --partition normally overrides the queue setting or the partition specified in the partition settings for --qos. See: https://slurm.schedmd.com/qos.html

Patrick

Last edited 6 months ago by pmcguire (previous) (diff)

comment:8 Changed 6 months ago by yaogao

Hi Patrick,

Thanks a lot! This is quite clear to me. I will try to adjust those parts and see what happens.

Yao

comment:9 Changed 5 months ago by grenville

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.