#2677 closed help (fixed)

Archive job is stuck for a cycle

Reported by: amenon Owned by: um_support
Component: UM Model Keywords: archive job stuck
Cc: Platform: ARCHER
UM Version: 10.9

Description

Dear CMS,

Sorry to create too many tickets in the past 2-3 weeks, I am running three suites at the moment and hence encountering several issues.

The suite u-bc870 in Archer is trying to archive the output of a cycle to RDF since yesterday morning. During each retry attempt, the archive job runs all fine, and then fails with time exceeded error. I tried increasing the time from 20 minutes to one hour. Also tried increasing the job memory from 1GB to 2 GB in suite-adds.rc at:

-l select=1:ncpus=1:mem=2GB

However, when I check the archived files in RDF, I realise that every time, when the archive job is running, the outputs of that cycle are touched within 10 minutes (but the size of the output files remains the same as the first run of the archive job for that cycle). After about 10 minutes, the archive job is running, but not writing out any output to RDF. It gets stuck at archiving the file 20160708T0000Z_INCOMPASS_km4p4_RA1T_qcons_pt000.pp and does not move forward after that. Does that mean that I am running out of space in RDF? I thought there is no limited quota for RDF. Please let me know.

Regards,
Arathy

Change History (11)

comment:1 Changed 11 months ago by grenville

Arathy

RDF quota is not the problem — have you used the archive app before; is there a successful case you can compare settings with?

Grenville

comment:2 Changed 11 months ago by amenon

Hi Grenville,

Yes, I always used the archive app. suite u-ai540 in Archer has archived the outputs into RDF. Also the current suite u-bc870 was successfully archiving the outputs into RDF for the preceding cycles from 20160701 to 20160707. Currently it is stuck for the cycle 20160708.

Regards,
Arathy

comment:3 Changed 11 months ago by grenville

It's wallclock time is 30 min — try making it bigger

comment:4 Changed 11 months ago by amenon

Yesterday I tried with 1 hour wallclock time too. I could try again with more, like, 2 hours. But when we check /nerc/n02/n0/amenon/u-bc870/field.pp while the archive job is running, then it seems no outputs are written out after 10 minutes. I will try again with 2 hours and will let you know.

comment:5 Changed 11 months ago by grenville

I can see that didn't work - please change the permissions of the tmp directories under
/home/n02/n02/amenon/cylc-run/u-bc870/work/20160708T0000Z/INCOMPASS_km4p4_RA1T_qcons_archive/

so that we can read them.

comment:6 Changed 11 months ago by amenon

Hi Grenville,

Changed the permissions of that directory.Also changed the permission of the directory /home/n02/n02/amenon/cylc-run/u-bc870/work/20160707T0000Z/INCOMPASS_km4p4_RA1T_qcons_archive/ in case you want to have a look at a succeeded cycle.

Arathy

comment:7 Changed 11 months ago by grenville

I finally see what the problem is:

-rw-r—r— 1 amenon n02 2130028408832 Nov 14 12:19 umnsaa_pt006

this file is 2TB, that's why it's taking so long to rsync.

We don't know why the file is corrupt.

I suggest that

(i) you rename that file so the archive app ignores it and retrigger the archive task - worry about how to fix the file later

or

(ii) you rerun the model for that cycle (delete umnsaa_pt006 first)

or

(iii) replace the file with /work/n02/n02/grenvill/ARATHY.pp (suitably renamed) - this the PP version of the fields file (from ff2pp) — check it looks OK first 'though.

Grenville

comment:8 Changed 11 months ago by grenville

Arathy

Jeff has fixed up umnsaa_pt006 - see /work/n02/n02/jwc/tmp/umnsaa_pt006.

However, please check the all umnsaa_pt files - there appears to be problems with the last few fields in each file.

Grenville

comment:9 Changed 11 months ago by amenon

Great! Thanks Grenville, Jeff. I will go through all the pt files.

comment:10 Changed 11 months ago by amenon

Hi,

I could get over this problem with the corrected output that Jeff created. I copied this output to the archive folder and set the archive task to 'succeeded'. Then the suite moved to the next cycle and continued. Thanks a lot.

Arathy

comment:11 Changed 10 months ago by willie

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.