Opened 3 months ago

Closed 3 weeks ago

#3425 closed help (answered)

fcm_make submit-failed; job.err file not found

Reported by: shyland Owned by: jules_support
Component: JULES Keywords: JULES, JASMIN
Cc: Platform: JASMIN
UM Version:

Description

Hi,

I'm working through the tutorial for setting up Rose / Cylc in order to run JULES on CEDA JASMIN (https://research.reading.ac.uk/landsurfaceprocesses/software-examples/tutorial-rose-cylc-jules-on-jasmin/) and have gotten to the point of running the suite u-al752. However, after a few attempts, I keep getting the same problem.

The gcylc window shows that the fcm_make and JULES parts succeed but make_plots keeps getting the status of 'submit-failed'. I went to look at ~/cylc-run/u-al752/log/job/1/fcm_make/01/job.err and this error appears: ERROR: file not found: /home/users/shyland/cylc-run/u-al752/log/job/1/make_plots/01/job.err.

When looking at~/cylc-run/u-al752/log/job/1/make_plots/01/job-activity.log, this error is mentioned: [STDERR] sbatch: error: Batch job submission failed: Requested time limit is invalid (missing or exceeds some limit) [(('event-mail', 'submission failed'), 1) ret_code] 0 . I tried rerunning the suite with an extended wall time of 4 hours (the maximum time available for the test queue with slurm on JASMIN) in ~/roses/u-al752/site/suite.rc.CEDA_JASMIN; I changed the time directive under [[JULES_CEDA_JASMIN]]. This didn't help. I then tried adding the line --time = 04:00:00 under the [[JASMIN_LOTUS]] directives sections, but again, this didn’t help either. All attempts have resulted in the same errors as previously mentioned.

The netCDF files in my jules_output file seem okay; I’m just struggling to make the plots. Any help on this matter would be greatly appreciated - many thanks in advance!!

Sian

Change History (20)

comment:1 Changed 3 months ago by pmcguire

Hi Sian
If the fcm_make app succeeds, then it is possible that the ~shyland/cylc-run/u-al752/log/job/1/fcm_make/01/job.err does not exist.

If the make_plots app fails to get submitted, then the file ~shyland/cylc-run/u-al752/log/job/1/make_plots/01/job.err will not exist.

The make_plots does appear to have failed to be submitted, based upon your ~shyland/cylc-run/u-al752/log/job/1/make_plots/01/job-activity.log message. I can't view your set-up right now. Can you give me permissions to read your roses and cylc-run directories? You will probably need to give me permissions to read your home directory as well, but if you have anything private or confidential in your home directory, then you will probably want to make that not readable.

Since you have it mostly working, you might try using the short-serial queue instead of the test queue, if that option is available to you.
You might need to wait in the short-serial queue for a while. It's possible that the plotting can be done in under 4 hours, in which case the short-serial-4hr queue might have a shorter wait in line.
Patrick

comment:2 Changed 3 months ago by shyland

Hi,

My apologies, I meant to put ~/cylc-run/u-al752/log/job/1/make_plots/01/job.err when looking for the reason for why the make_plots app failed, but the failure to submit makes sense for why that file won't exist.

I believe I have just given you permission to read my directories, they currently look like this:

[shyland@cylc1 ~]$ ls -ltrd ~
drwxr-xr-- 28 shyland users 0 Nov 19 11:14 /home/users/shyland

Okay, brilliant, I'll give the short-serial-4hr queue a go!!

Many thanks,
Sian

comment:3 Changed 3 months ago by pmcguire

Hi Sian
Thank you for changing the permissions.

I looked at:
~shyland/cylc-run/u-al752/log/job/1/make_plots/01/job

This is the script that was submitted to SLURM for your most recent run for this app of this suite.
It still has 8 hours of wall-clock request time.

I see in your settings file:
~shyland/cylc-run/u-al752/site/suite.rc.CEDA_JASMIN
that you still have 8 hours requested in the [[PLOTTING_CEDA_JASMIN]] section.
As you know, for the SLURM test queue, only 4 hours is allowed.
You had changed it to 4 hours in the [[JULES_CEDA_JASMIN]] section, but the plotting is separate from the JULES runs.

If you just want to rerun the plotting without rerunning the JULES, you can:

1) make the changes to the settings in ~shyland/cylc-run/u-al752/site/suite.rc.CEDA_JASMIN, and then
2) do a rose suite-run --reload at the command line, and then
3) use the Cylc GUI and retrigger the make_plots app.
4) a short time after that, you can check your new ~shyland/cylc-run/u-al752/log/job/1/make_plots/01/job file and it should have the
revised settings in there.

Does this help?
Patrick

Last edited 3 months ago by pmcguire (previous) (diff)

comment:4 Changed 3 months ago by shyland

Hi,

I tried what you suggested and the make_plots job now gets submitted, however I am met with this critical error:

cpu-bind=MASK - host210, task  0  0 [22438]: mask 0x8 set
2020-11-21T15:24:49Z CRITICAL - failed/EXIT

This error has occurred a couple of times now. Related to this problem, I was just wondering if there is some online documentation that could help explain what these errors actually mean and how to troubleshoot them?

Many thanks,
Sian

comment:5 Changed 3 months ago by pmcguire

Hi Sian
I don't know what that critical error means. The first line is pretty standard, from what I recall, and I don't think it means anything particularly important. The second line says that it failed.

From the time stamps of your job.err file and your job file (in ~shyland/cylc-run/u-al752/log/job/1/make_plots/01), it appears that this error happened 16 minutes after submission. This is not a horribly long time to wait to try again.

So some options:
1) try again without making any changes.
2) try again, but put some print statements in your ~shyland/roses/u-al752/bin/make_plots.py program, to verify that it is actually getting into that program and starting to run. I am not sure if a rose suite-run --reload executed at the command line after the changes will pick up the changes or not (prior to retriggering the make_plots app in the cylc GUI). So you can check if it did, by looking at the ~shyland/cylc-run/u-al752/bin/make_plots.py after the run starts, then that might be helpful.
3) try again, but in the short-serial queue or the short-serial-4hr queue instead of the test queue. You might have to wait hours or days in those queues right now. But maybe the 4 hours you asked for in the test queue is not long enough to finish the plotting, or maybe there is something wrong with the test queue.

Does this help?
Patrick

comment:6 Changed 3 months ago by pmcguire

Hi Sian:
Is it working better now?
Patrick

comment:7 Changed 3 months ago by shyland

Hi,

When trying the first two of your suggestions, I was still getting the same critical error as I previously mentioned. So I tried submitting the fcm_make job to the short-serial-4hr queue and again, got the same critical error. So I shut the suite down and attempted to run it all again in the short-serial queue, however the JULES job has now failed with 64 of the 79 tasks having been successful. The bottom of the long error states:

*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[host615.jc.rl.ac.uk:38271] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!
[FAIL] rose-jules-run <<'__STDIN__'
[FAIL] 
[FAIL] '__STDIN__' # return-code=1
2020-11-30T19:26:55Z CRITICAL - failed/EXIT

Many thanks,
Sian

comment:8 Changed 3 months ago by pmcguire

Hi Sian:
in your error log file for us_atq:
~shyland/cylc-run/u-al752/log/job/1/*us_atq*/01/job.err ,
I see:

Please verify that both the operating system and the processor support Intel® X87, CMOV, MMX, FXSAVE, SSE, SSE2, SSE3, SSSE3, SSE4_1, SSE4_2 and POPCNT instructions.

This suggests that this run was run on a non-Intel processor, whereas it probably was compiled on an Intel processor. JASMIN has a heterogeneous architecture, and it helps to compile and run on the same type of processor.

You can select the processor type when you run JULES in
~/roses/u-al752/site/suite.rc.CEDA_JASMIN
with the constraint directive:

    [[JASMIN_LOTUS]]
        inherit = None, JASMIN

        [[[directives]]]
            --partition = short-serial
            --constraint = "ivybridge128G"

see the list of processor types at:
https://help.jasmin.ac.uk/article/4932-lotus-cluster-specification
The different queues have differing proportions of processor types.
Patrick

comment:9 Changed 3 months ago by shyland

Hi,

I added the constraint directive and submitted the suite which ran overnight in the short-serial queue and once again, the fcm_make job failed again with the critical error:

cpu-bind=MASK - host092, task  0  0 [12558]: mask 0x40 set
2020-12-02T11:02:00Z CRITICAL - failed/EXIT

-Sian

comment:10 Changed 3 months ago by pmcguire

Hi Sian:
It looks iike your make_plots failed and not your fcm_make. Or maybe I am missing something?

I copied your ~shyland/cylc-run/u-al752/log/job/1/make_plots/02/job script that is generated by your suite
to ~pmcguire/test/shyland/job.
In that script, I changed --partition=short-serial to --partition=par-multi, since the par-multi queue/partition seemed to working today well, without a long wait, and I wasn't sure how long the wait would be in short-serial.

I tried running this job script for make_plots. I had to change shyland to pmcguire in various places, and I had to make a copy of your cylc-run/u-al752 directory, and change owners to me, so that I could create files in that copy of the directory. I also had to put quotes around your print-statement arguments (print '%% VERIFY %%' and print '%% VERIFY DEF %%'). But with these changes, it seems to start running. I don't know yet if it will finish running.

You can see the log files in:

/home/users/pmcguire/cylc-run/u-al752shyland/log/job/1/make_plots/02/*

wherein you can see that the Python script at least starts to run.

This is using the results of the jules runs which you already ran.

I am running this script from the cylc1.jasmin server with sbatch job.

Does this help?
Patrick

Last edited 3 months ago by pmcguire (previous) (diff)

comment:11 Changed 3 months ago by shyland

Hi,

Yes, sorry, my mistake!! I meant to say that the make_plots failed.

I shall check later today to see if the job is successful and report back.

Thank you very much,
Sian

comment:12 Changed 3 months ago by pmcguire

Hi Sian:
There are already some plots in the plots directory! So it is successful, in that it is making PDF plots.
Patrick

comment:13 Changed 3 months ago by shyland

Hi,

Amazing - I just found your plots directory. Thank you!!

For future reference, is there a particular reason as to why my make_plots job wasn't successful?

Many thanks,
Sian

comment:14 Changed 3 months ago by pmcguire

Hi Sian:
I don't know why your make_plots job wasn't successful. I did make some changes to your script to get my copy of your script to run. Maybe you can make the same changes, and see if it works?
Patrick

comment:15 Changed 3 months ago by shyland

Hi,

I went to look at your adapted script in ~pmcguire/shyland/job in order to make the same changes to my script, but I cannot seem to find it. Or did you mean I was to try making the changes you described, e.g. change to --partition=par-multi, put quotes around my print-statement arguments, etc?

Many thanks,
Sian

comment:16 Changed 3 months ago by pmcguire

Hi Sian:
Yes, that is one (good) way to do it.
Patrick

comment:17 Changed 3 months ago by shyland

Hi,

So I made the changes you described, i.e. change to --partition=par-multi and put quotes around my print-statement arguments, and my make_plots job is still failing. I looked at my job.out file and the print statements aren't there so it doesn't seem to be actually getting into that program and starting to run. I compared my job script to your's in ~pmcguire/test/shyland/job and they seem to be the same (with the exception of our different output directories).

I've looked at the PDF plots from your successful make_plots job, however I'm working through the tutorials in order to get a better understanding of how to set up my own JULES runs, so the specifics of the PDF output isn't too important to me right now. (I looked into the PDFs in much more detail back in May before the SLURM transition.) I'd much rather know how to submit a successful job and why my job isn't working, before moving on to the global JULES run tutorial and potentially running into similar issues.

Many thanks,
Sian

comment:18 Changed 3 months ago by pmcguire

Hi Sian:
Maybe all of your jules jobs for this run of your u-al752 suite didn't finish? And that's why your make_plots didn't run?

What happens when you do a rose sgc on cylc1.jasmin for this suite? Do you see any remaining jules jobs that haven't finished in the GUI?

When I do these commands:

grep   succeeded ~shyland/cylc-run/u-al752/log/job/1/j*/0*/job.out  |wc
     79     316    9172
ls -ltr ~shyland/cylc-run/u-al752/log/job/1/j*/0*/job.out |wc
     82     738   10341

this suggests that 79 jules jobs succeeded, whereas 82 of them attempted to be run.

I also see in shyland/cylc-run/u-al752/log/suite/log that you have a WARNING - suite stalled message. I am not sure what to make of that.

One option is to try starting over with rose suite-run --new. But that might take a couple of hours extra at least.

Patrick
Patrick

comment:19 Changed 3 months ago by shyland

Hi,

When I do rose sgc, there are no remaining JULES jobs. Three of the jobs initially failed but I triggered them this morning and they have since succeeded.

Okay, I shall try rose suite-run --new now.

Thanks,
Sian

comment:20 Changed 3 weeks ago by ros

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.