Opened 2 months ago

Closed 4 weeks ago

#3229 closed help (answered)

CMOR conversion - help! lost files

Reported by: mvguarino Owned by: um_support
Component: UKESM Keywords:
Cc: Platform:
UM Version:

Description

Hello,

I need some help understating what is going on with my simulation outputs now that I am trying to CMOrize (http://cms.ncas.ac.uk/wiki/CDDS). I am at the Ocean data preparation stage and something must have gone horribly wrong as I seem to have lost files in the original simulation directory where the model outputs were stored…

While I am trying not to panic.. I would very much appreciate some help to recover those files. Here is what has happened:

I have followed the instruction and run a copy of the u-bn582 suite, mine is u-bs716.
It seemed to run fine until I hit a problem:

the suite stopped suddenly while a few (5 /6) tasks were running, I later figured the problem was that some of the files are stored in temporary directories called cdds_holding in /home/users/marino/cylc-run/u-bs716/gws/etc/etc/task_number.
If many tasks are running at the same time this causes my Jasmin quota to go over the limit of 100GB and all jobs get stopped.

In order to restart, I had to remove the files from these temporary cdds_holding directories and restart the suite.

Now… when I checked the outputs files for those stopped tasks I realised that not all the model outputs were there, see for example /gws/nopw/j04/pmip4_vol1/users/vittoria/u-ba937-cdds/18520601T0000Z and…. when I then checked the original corresponding directory /gws/nopw/j04/pmip4_vol1/users/vittoria/u-ba937/18520601T0000Z also here all files have disappeared expect for one single nemo output file!

What happened? and is there anyway I can recover those files? it all happened between yesterday and today

Thanks,

Vittoria

Change History (28)

comment:1 Changed 2 months ago by mvguarino

Update: I managed to recover some of the files using the backups available of my home directory, but sadly not all.

Assuming I have lost those files for good, as CMOR conversation is something that must be done, would it be possible to design a safer approach to it?
Also, how to restart safely u-bn582 if this should fail without the risk of losing data?

Thanks,
Vittoria


comment:2 Changed 2 months ago by grenville

Vittoria

It looks like you have clobbered your data - regrettably there is no way back (unless you have it on Elastic Tape). How many months are lost?

I am a little confused as to why your home space is filling up. The holding directory should have gone under /home/users/marino/cylc-run/u-bs716/work/… which is a link to — ah now I see, you set

rose-suite.conf:root-dir{work}=*=gws/nopw/j04/pmip4_vol1/users/vittoria

but should hase set

rose-suite.conf:root-dir{work}=*=/gws/nopw/j04/pmip4_vol1/users/vittoria

one "/" has caused a big problem!!

Grenville


comment:3 Changed 2 months ago by grenville

I'm not sure how to prevent typos — we will review the procedure to see if we can catch something like this in the future

Grenville

comment:4 Changed 2 months ago by mvguarino

Hi Grenville,

I have lost about 2 years of run, as they were the first 2 years I am re-running them now.

So I can fix the typo, but .. mhhh.. I am little scary now to try it again. Is there a way to CMORize first a smaller number of files? to check that everything is behaving as it should?

Vittoria

comment:5 Changed 2 months ago by grenville

Hi Vittoria

I can only suggest that you copy a few cycles to a new directory and point the suite to it.

Grenville

comment:6 Changed 2 months ago by mvguarino

well that makes sense :)

thanks,

Vittoria

comment:7 Changed 2 months ago by grenville

Double check in your cylc-run directory that the links point to real group workspace.

comment:8 Changed 2 months ago by grenville

Never delete anything in cdds-holding - please contact us if you suspect a problem.

comment:9 Changed 2 months ago by mvguarino

I will, thanks.
Waiting now to have all the data before trying again.

Meanwhile, as I have made a new copy of u-bn582 to start from scratch I have realized that the path is wrongly set in the suite (u-bn582, owner josephabram), as it is :
gws/nopw/j04/rdf_migrate_vol2/my_username

Could you double check?

Vittoria

comment:10 Changed 2 months ago by grenville

Please see:

http://cms.ncas.ac.uk/wiki/CDDS/halo

and note "This must be set to a location on the same group workspace as HALO_START_DIR"

comment:11 Changed 2 months ago by mvguarino

Hi,

yes, indeed on the webpage the path is correct (the \ is there) but when one makes a copy of u-bn582 the path of root-dir{work} (to be customised by the user) in rose-suite.conf misses the first \ .
What I meant is that perhaps changing and checking out a version of u-bn582 with the right path would minimise the risk of others having the same problem that I had (I think I modified the path without noticing the missing \) .

Vittoria

comment:12 Changed 2 months ago by mvguarino

Hello,

So I am running the suite u-bs776 (to convert ocean outputs) on a test case: u-bq809.
This simulation is made of about 3 years of data, see /gws/nopw/j04/pmip4_vol1/users/vittoria/u-bq809.

The conversion process seems to have worked fine, I now have a new directory (u-bq809-cdds) with all UM and CICE outputs + converted NEMO outputs. However, only the first 12 months got processed. The suite stopped after that for no apparent reason without processing the rest of the data.

Would issuing the simple rose suite-run —restart command fix it?
How do I make sure that all the data will be processed once I run the suite on a 200 years long simulation?

Thank you very much,

Vittoria

P.S. my Jasmin username is marino, if needed

comment:13 Changed 2 months ago by grenville

Vittoria

rose suite-run —restart couldn't do any harm (assuming you have not changed the suite)

Please try it.

Grenville

comment:14 Changed 2 months ago by mvguarino

Hi Grenville,

I was checking jobs by 'bjobs' and not through the GUI (JASMIN is being a bit weird these days and I keep on having a "cannot open display" error, at times, that goes away on its own). Now that I can finally open the GUI again I can actually see that one task (n8) got stuck: it says it is running but is not (bjobs: No unfinished job found).
Indeed the month of August is missing!

Shall I restart the single task?

Thanks,

Vittoria

comment:15 Changed 2 months ago by grenville

Vittoria

rose suite-restart is a safer option - it will not pick up any suite changes, but please don't do anything just yet.

Grenville

comment:16 Changed 2 months ago by grenville

Vittoria

I'm going to put the failure down to a JASMIN issue - please rose suite-restart and retrigger the failed task.

I'm sorry that you are experiencing these problems.

Grenville

comment:17 Changed 2 months ago by mvguarino

Hi Grenville,

Okay I have restarted the task and that worked, data got transferred from 8/cdds-holding to u-bq809-cdds, and more jobs are being submitted now.
I'll keep on eye on ghost jobs for the future and restart them if needed.

Vittoria

comment:18 Changed 2 months ago by mvguarino

Hi Grenville,

I am continuing testing the CMOR-isation process on u-bq809.

I am now at the "CDDS process" stage. I get the following error:

[marino@jasmin-cylc cmor_test_ubq809]$ source cdds_workflow_for_user.sh
####################### Starting CDDS ##################
Main directory for work: CDDS_DIR:
/gws/nopw/j04/pmip4_vol1/users/vittoria/cmor_test_ubq809
Request file:
bq809.json
CDDS processing directory
/gws/nopw/j04/pmip4_vol1/users/vittoria/cmor_test_ubq809/cdds_proc
CDDS data directory
/gws/nopw/j04/pmip4_vol1/users/vittoria/cmor_test_ubq809/cdds_data
Data request directory
/gws/smf/j04/cmip6_prep/cdds-env/etc-from-mohc/data_requests/CMIP6
Sourcing CDDS software environment...
Sourced CDDS software environment file:
/gws/smf/j04/cmip6_prep/cdds-env/setup_cdds_env_cdds132.sh
parse error: Expected another key-value pair at line 19, column 1
parse error: Expected another key-value pair at line 19, column 1
parse error: Expected another key-value pair at line 19, column 1
parse error: Expected another key-value pair at line 19, column 1
parse error: Expected another key-value pair at line 19, column 1
parse error: Expected another key-value pair at line 19, column 1
Exporting temp dir TMPDIR=/gws/nopw/j04/pmip4_vol1/users/vittoria/cmor_test_ubq809/tmp ...
####################### Verifying CDDS components ##################
/gws/smf/j04/cmip6_prep/jaspy_base/jaspy/miniconda_envs/jaspy2.7/m2-4.6.14/envs/cdds-env-r20200204/bin/create_cdds_directory_structure
create_cdds_directory_structure 1.3.2
/gws/smf/j04/cmip6_prep/jaspy_base/jaspy/miniconda_envs/jaspy2.7/m2-4.6.14/envs/cdds-env-r20200204/bin/prepare_generate_variable_list
prepare_generate_variable_list 1.3.2
/gws/smf/j04/cmip6_prep/jaspy_base/jaspy/miniconda_envs/jaspy2.7/m2-4.6.14/envs/cdds-env-r20200204/bin/generate_user_config_files
generate_user_config_files 1.3.2
/gws/smf/j04/cmip6_prep/jaspy_base/jaspy/miniconda_envs/jaspy2.7/m2-4.6.14/envs/cdds-env-r20200204/bin/cdds_convert
cdds_convert 1.3.2
/gws/smf/j04/cmip6_prep/jaspy_base/jaspy/miniconda_envs/jaspy2.7/m2-4.6.14/envs/cdds-env-r20200204/bin/mip_convert
mip_convert 1.3.2
/apps/contrib/metomi/bin/rose
####################### Running CDDS pipeline stages ##################
Creating CDDS directory structure...
Using CDDS Prepare version 1.3.2
Expecting property name: line 19 column 1 (char 1297)
Traceback (most recent call last):
  File "/gws/smf/j04/cmip6_prep/cdds-env/r7785_trunk_cdds_v132/cdds_prepare/cdds_prepare/command_line.py", line 46, in main_create_cdds_directory_structure
    create_cdds_directory_structure(args)
  File "/gws/smf/j04/cmip6_prep/cdds-env/r7785_trunk_cdds_v132/cdds_prepare/cdds_prepare/directory_structure.py", line 26, in create_cdds_directory_structure
    request = read_request(arguments.request, REQUIRED_KEYS_FOR_PROC_DIRECTORY)
  File "/gws/smf/j04/cmip6_prep/cdds-env/r7785_trunk_cdds_v132/hadsdk/hadsdk/request.py", line 43, in read_request
    items = read_json(request_path)
  File "/gws/smf/j04/cmip6_prep/cdds-env/r7785_trunk_cdds_v132/hadsdk/hadsdk/common.py", line 188, in read_json
    data = json.load(file_handle)
  File "/gws/smf/j04/cmip6_prep/jaspy_base/jaspy/miniconda_envs/jaspy2.7/m2-4.6.14/envs/cdds-env-r20200204/lib/python2.7/json/__init__.py", line 291, in load
    **kw)
  File "/gws/smf/j04/cmip6_prep/jaspy_base/jaspy/miniconda_envs/jaspy2.7/m2-4.6.14/envs/cdds-env-r20200204/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/gws/smf/j04/cmip6_prep/jaspy_base/jaspy/miniconda_envs/jaspy2.7/m2-4.6.14/envs/cdds-env-r20200204/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/gws/smf/j04/cmip6_prep/jaspy_base/jaspy/miniconda_envs/jaspy2.7/m2-4.6.14/envs/cdds-env-r20200204/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting property name: line 19 column 1 (char 1297)
Traceback (most recent call last):
Found Python Traceback in /gws/nopw/j04/pmip4_vol1/users/vittoria/cmor_test_ubq809/*create_cdds_directory_structure*.log*. Stopping. 

They key problem seems to be:

 /gws/smf/j04/cmip6_prep/cdds-env/setup_cdds_env_cdds132.sh 
parse error: Expected another key-value pair at line 19, column 1  

I have tried to use cdds_workflow_for_user_v132.sh and cdds_workflow_for_user_v121.sh but I get the same error.

Thanks,

Vittoria

comment:19 Changed 2 months ago by grenville

Vittroria

Please raise cdds tickets on git at https://github.com/cedadev/jasmin-cdds/
I have asked Valeriu to allow you access. I assume you have a git account?

Grenville

comment:20 Changed 2 months ago by grenville

Vittoria

We need to know your github username

Grenville

comment:21 Changed 2 months ago by mvguarino

I have one, my GitHub? account is MariaVike? (https://github.com/MariaVike).

Let me know what else is needed, I cannot open the link you provided earlier (404 error message).

Vittoria

comment:22 Changed 2 months ago by grenville

Thanks - I'll pass this to V. You will need to be logged in to git to view the site.

comment:23 Changed 2 months ago by grenville

Vittoria

You should have access to https://github.com/cedadev/jasmin-cdds/ now — please raise the issue there.

Grenville

comment:24 Changed 2 months ago by mvguarino

Hi Grenville,

Okay, I have done it. Hopefully in the right way, never used GitHub? this way before.

Thanks,

Vittoria

comment:25 Changed 2 months ago by grenville

Vittoria

I can't see your issue - please go here

https://github.com/cedadev/jasmin-cdds/issues

and click on New Issue - add your query there.

Grenville

comment:26 Changed 2 months ago by mvguarino

Oh I see, hopefully now it is in the right place.
many thanks,

Vittoria

comment:27 Changed 2 months ago by grenville

yep

comment:28 Changed 4 weeks ago by grenville

  • Resolution set to answered
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.