Opened 3 weeks ago

Last modified 27 hours ago

#3003 new help

Archiving job looks stuck

Reported by: amenon Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: Monsoon2
UM Version: 10.9

Description

Dear CMS team,

I am re-running the ensemble suite in Monsoon. Suite id is u-bb030. I am trying to archive the first 7 ensemble outputs first (the next 3 I am holding the jobs due to disk space issues). The archive job for ensemble 0 succeeded. But it keeps failing for the next 6 ensembles (em1 to em6). However, from Jasmin I can see that MASS already has the output from all the 7 ensembles. So my question is, is it that the archive job succeeded for all the 7 ensembles and cylc is not setting the status to succeeded? The last streams (PZ streams) in the all the ensembles look correct when I check them in MASS. However, I would like to make sure that the archiving is done correctly (or it has archived all the output for em1 to em6) before manually setting the archive job status to 'succeeded' in the gcylc window. Is there a moo command available that will help me to check the size of these output files in MASS, so that I can make sure that it has archived all the data? Or is there any other way?

Thanks,
Arathy

Change History (9)

comment:1 Changed 3 weeks ago by willie

Hi Arathy,

If you do moo ls -l moose:/devfc/u-bb030, you get

C stuart.webster             50.37 GBP    4614773295648 2019-08-28 13:29:12 GMT moose:/devfc/u-bb030/field.pp

This shows that you have already got roughly 4.6TB of data in the archive. You can list the details with moo ls -l moose:/devfc/u-bb030/field.pp. It might be a good idea to make an experimental copy of u-bb030 so that you don't overwrite what you already have.

This ticket follows on from #2990. Have you created 4.6TB of data since the comments there?

Willie

comment:2 Changed 3 weeks ago by amenon

Hi Willie,

du -hs in /projects/cascade/arame/cylc-run/u-bb030/share/cycle/20160703T0000Z/INCOMPASS/km4p4/ra1t_inc4p4/um shows that 7 TB data is created from the 7 ensembles. So it looks like all of the data is not archived. I have given the archive jobs the maximum time limit of 3 hours. So shall I keep triggering the archive jobs, one at a time, until it succeeds? Stu said that umpp command takes a long time to succeed.

One suggestion Stu gave was to move some of the outputs from cylc-run into a different folder and archive only few files in one go, then delete those outputs from cycl-run folder, move the next few files into the folder to archive and trigger the archive job again. But currently it shows all files in the MASS achive folder, hence I can't figure out which all files are partially archived. So shall I stick to the method I said in the above paragraph (i.e triggering the archive jobs until they succeed)?

Btw, when I started re-running the suite following #2990 I didn't have any output in MASS. This 4.6 TB is created after #2990.

Thanks,
Arathy

comment:3 Changed 3 weeks ago by willie

Hi Arathy,

If you look at the archive error file for em2 (em1 has gone)

/home/d04/arame/cylc-run/u-bb030/log/job/20160703T0000Z/INCOMPASS_km4p4_ra1t_inc4p4_archive_em2/NN/job.err

at the bottom it says

=>> PBS: job killed: walltime 10806 exceeded limit 10800

so your archive tasks are taking more than the 3 hours allocated. I have timed the time needed just to convert all 384 files in em1 using umpp. It takes 2hours 45 minutes. The archiving will take a similar amount of time, so you are up at five and a half hours. This is beyond the limit of three hours allocated and also beyond the time limit of the "umshared" queue.

So I think you will need to modify the suite. In the monsoon site suite-adds.rc file change

    [[ARCHIVE]]
        init-script = module load moose-client-wrapper
        [[[directives]]]
            -l walltime = 03:00:00

to

    [[ARCHIVE]]
        init-script = module load moose-client-wrapper
        [[[directives]]]
             -q = normal
             execution time limit = PT12H

This will give it plenty of time.

Willie

comment:4 Changed 2 weeks ago by amenon

Hi Willie,

The above change didn't work. rose suite-run —reload gave an error normal queue does not exit. Also when I gave the wallclock limit in the above format the archive job was not able to get submitted. It kept on giving the submit-failed error. So I changed the above in the suite-adds.rc to

[[ARCHIVE]]
        init-script = module load moose-client-wrapper
        [[[directives]]]
	    -q = {{PARALLELQ}}
            -l walltime = 4:00:00

As the wallclock limit to parallel queue is 4 hours, I am still not able to complete the archive task in the given time. Could you please have a look how to assign this job to the normal queue?

Thanks,
Arathy

comment:5 Changed 12 days ago by willie

Hi Arathy,

You need to make the changes to site/monsoon-cray-xc40/suite-adds.rc. PARALLELQ is just an alias for normal, but it looks odd to submit a serial task to a parallel queue, so I would have put -q = normal just to make it clear. The queue limits can be found by running

qstat -q

so the normal queue can handle durations of up to 24 hours. I recommend

{{
execution time limit = PT12H
}}

This is more than you need, but you can fine tune it later once you have it working.

Willie

comment:6 Changed 12 days ago by amenon

Hi Willie,

I have added these lines as per your earlier suggestion to site/monsoon-cray-xc40/suite-adds.rc. That is when it failed to submit the archive job and the job.activity.log said that normal queue does not exist (Now I guess, there might have been some issue with normal queue in the Monsoon last week). Hence I changed it to PARALLELQ and gave it a try.

I retired it again now with

[[ARCHIVE]]
        init-script = module load moose-client-wrapper
        [[[directives]]]
             -q = normal
             execution time limit = PT12H

At first, submit failed with the following message in the job-activity.log:

[jobs-submit out] 2019-09-09T11:07:00Z|20160703T0000Z/INCOMPASS_km4p4_ra1t_inc4p4_archive_em1/43|1|None
2019-09-09T11:07:00Z [STDERR] qsub: directive error: execution time limit=PT12H
[(('event-mail', 'submission failed'), 43) ret_code] 0

Then I replaced the line 'execution time limit = PT12H' with '-l walltime = 12:00:00'. This format then gave the following error in job-activity.log:

2019-09-09T11:10:00Z [STDERR] qsub: error: [PBSSitePolicy] project 'cascade' is limited to a walltime of 04:00:00 for 'normal' queue

Then I changed '-l walltime = 12:00:00' to '-l walltime = 04:00:00' and the job got submitted now and is running. But I am not sure if it will finish in 4 hours.

Arathy

comment:7 Changed 9 days ago by willie

Hi Arathy,

I'm still looking at this. Did you get the task install_engl_startdata to work? Im getting strange failures.

Willie

comment:8 Changed 9 days ago by amenon

Thanks Willie. Yes, I got install_engl_startdata work.

comment:9 Changed 27 hours ago by willie

Hi Arathy,

I am on leave for a week. There are some problems with MASS at the moment which the Met Office are looking at. It may be quicker for you to take a fresh copy of the ensemble nesting suite after these problems are fixed - it already has the archiving code switched on (it's different from Stu's suggestion). Add your changes and rerun. You need to have the aggressive house keeping and the dm_arch flags set together so it deletes as it goes. But keep an eye out on Yammer for Monsoon/MASS status updates.

Willie

Note: See TracTickets for help on using tickets.