Opened 4 years ago

Closed 4 years ago

#1619 closed help (fixed)

job id xlqid

Reported by: avanni Owned by: um_support
Component: UM Model Keywords: IBM
Cc: Platform: MONSooN
UM Version: 8.5

Description

Hello,

I have been resubmitting this job for a few days now trying to figure out why it wont run anymore. I managed to get to one month and then it no longer wants to run. I can't see anything from the .leave files but they say there is a floating point exception error. Is it maybe something to do with it trying to take a monthly mean and there being too many headers?

Any ideas why this might be happening?

thanks,

Annelize

Change History (45)

comment:1 Changed 4 years ago by willie

  • Keywords IBM added
  • Platform set to MONSooN
  • UM Version changed from <select version> to 8.5

Hi Annelize,

Do you still need help with this? You seem to be still working on this job.

regards

Willie

comment:2 Changed 4 years ago by avanni

Hi Willie,

I am trying different thing, such as reinitialising with a new jobid but this does not seem to be working and I have no real clue as to why it isnt working. If you could look at the .leave files and maybe see where my floating point exception (or whatever the real problem may be) is coming from, that would be much appreciated.

Thanks,

Annelize

comment:3 Changed 4 years ago by willie

Hi Annelize,

I think the last good run was ~avanni/output/xlqid015.xlqid.d15212.t211105.leave which finished at time step 4464. This is exactly 31 days and in this leave file it cannot find xlqida.da20091230_00. You then tried to force it past this point and got the FP invalid error. So I think this might be due to climate meaning and the difference between 30 day and 31 day months. Does that ring any bells?

I think you could also switch off the top and bottom scripts on the Script Mods page.

Regards

Willie

comment:4 Changed 4 years ago by avanni

Hi Willie,

That seems like it could be the problem but what I do not understand is why this only occurs at this resolution. I have two other jobs that are set up in the exact same way (apart from their difference in resolution) and they do not hault at this point.

How can I prevent this from happening again?

Also, I have the .astart file for this run now. Can I use that to reinitialise the model without reconfiguring etc.?

Thanks,

Annelize

comment:5 Changed 4 years ago by willie

Hi Annelize,

Looking again at the leave files, I can see that the failing CRUN, ~avanni/output/xlqid016.xlqid.d15212.t213758.leave, had the FP invalid error after only two time steps. This occurred in the dynamics advection code, departure_point_eta_mod.F90. Sometimes there is more information in the pe_output files (in …fort1074 in this case), but these have been overwritten.

It would be advisable to CRUN from time step 4465 if possible and then look in the pe_output and any 'core' files that are produced to trace the error.

Regards,

Willie

comment:6 Changed 4 years ago by avanni

Hi Willie,

I have tried to CRUN the xlqid run from the file xlqida.da20100101_00 and still getting the floating point error. I have also tried starting a new job with this as the startdump. There doesn't seem to be anything in the file projects/glomodel/avanni/xlqid/pe_output/xlqid.fort6.pe0 that indicates any failure.

I am still not sure why this is happening and why it was not happening with my other lower-resolution runs?

Thanks,

Annelize

comment:7 Changed 4 years ago by willie

Hi Annelize,

The last run in the CRUN chain is now /home/avanni/output/xlqid015.xlqid.d15221.t105425.leave which deals with time steps 4321 - 4466, a slightly different range than the previous CRUN sequence. This has the FP invalid error, but there are two subsequent NRUNS(?) and I think these have over written the pe_output.

We need to repeat the CRUN 4321 - 4466, which will fail, and then look at the output. So

  • remove the coredir.31 directory
  • reinstate the history file
  cp history_archive/temp_hist004 xlqid.xhist
  • reinstate the start dump xlqida.da20091231_00
  • reinstate the PP files xlqida.p?20091229
  • do a CRUN - this needs to be the CRUN setup after the first NRUN (xlqid000.xlqid.d15220.t101506.leave)

It should then set itself up from the history file, do one CRUN and fail at the end. Then we can look at the pe_output and examine the core file.

Regards,

Willie

comment:8 Changed 4 years ago by avanni

Hi Willie,

I have tried this. Whether or not I did it correctly, however, I am not sure of.

I couldn't retrieve the PP files in the original format so the model recreated them and now they are just blank.
I cannot see anything in the pe_output file but then I also am not sure what I am looking for.

Thanks,

Annelize

comment:9 Changed 4 years ago by willie

Hi Annelize,

We need to determine the cause of the error. I think the simplest thing to do is a clean restart. Return the job to its original configuration and start a new run sequence. This will either succeed and you'll be on your way or it will fail at the usual place. If the latter case do not modify anything until I can look at the data.

So,

  • add the following
    ulimit -c unlimited
    
    to your .profile (if you're using Korn) or .bashrc if Bash
  • remove the directory /projects/glomodel/avanni/xlpid
  • return the job xlqid to it original config and then do an NRUN
  • when this completes do a CRUN and leave it

Let me know when it succeeds/fails, but don't modify further until I've had a look.

My plan is to look at the last thing in the pe_output to find out what it was doing and to check any core files produced to get further details.

Regards

Willie

comment:10 Changed 4 years ago by avanni

Hi Willie,

I have re-run the job and it has crashed at the same time again.
I mistakenly did not add the ulimit -c unlimited to my .bashrc, however.
Do I need to rerun again?

Thanks,

Annelize

comment:11 Changed 4 years ago by willie

Gi Annelize,

The invalid instruction due to taking the square root of a negative number in departure_point_eta_mod.F90. The last thing in the pe_output is

********************************************************************************

 EG_SISL_Resetcon: calculate reference profile
NUDGING_MAIN: Entering routine 
 Leaving NUDGING_MAIN


********************************************************************************

 EG_SISL_Resetcon: calculate reference profile
NUDGING_MAIN: Entering routine 
 Leaving NUDGING_MAIN


********************************************************************************

 EG_SISL_Resetcon: calculate reference profile

so it would appear that the error does not occur in nudging. Do you have the old .leave files somewhere? I'd like to check whether it is always a sqrt problem.

I am assuming that your N216 job xlqia worked. The main differences apart from resolution are

  • increase processors from 8x32 to 32x40
  • Change memory limit from 1.5 to 1.6 GB
  • use of IO services where none used before

The worry is that it is a memory problem and not a sqrt problem at all.

Regards

Willie

comment:12 Changed 4 years ago by avanni

Hi Willie,

I removed the .leave files when I was starting fresh because there were just too many. Sorry!

Yes, the N216 worked and so did an identical job at N96 resolution.

Is there anything I can do to test if it is a memory problem?

Regards,

Annelize

comment:13 Changed 4 years ago by willie

Hi Annelize,

Not to worry. I think we should do another clean restart but increasing the memory limit to 1.8GB in the Job Submission panel and click loadleveller. We won't get a core file because this doesn't happen for floating point invalid errors (it is still a good idea to put the ulimit in). But we will get a repeat run with a bit more memory so we can see if the error is consistent.

Regards

Willie

comment:14 Changed 4 years ago by avanni

Hi Willie,

I restarted it again with 1.8GB and changed the ulimit in my .bashrc. The job failed again at the same timestep with the square root issue.

Any ideas?

Thanks,

Annelize

comment:15 Changed 4 years ago by willie

Hi Annelize,

Do you know the provenance of xlqid? I am interested in the original Met Office job if you know it and any changes you made, especially new branches. I think I will take a copy of your job and run it with some debugging on and maybe experiment with the processor config to absolutely rule out memory problems. So leave xlqid and its support data (start dumps and other input files) alone for now while I work on it.

Regards

Willie

comment:16 Changed 4 years ago by avanni

Hi Willie,

This job is copied from the original job anbbw which was ported under the jobid xjaba.
I have run this job successfully before with a different year (jobid xllgb). This job did cut off after three months and I had to restart it, but it did run successfully when I restarted it.

Thanks a lot for looking into this.

Regards,

Annelize

comment:17 Changed 4 years ago by willie

Hi Annelize,

I notice (check setup) that some of your STASH means are not correctly defined. e.g. TMPMN00. This specifies a meaning period of one dump period, which you have defined elsewhere to be daily. However you have specified a sampling period of 24 hours, so there will only be one point to be the average. I am not sure what you're intending here, but you could change the sampling period to be 1 time step.

Could you take a copy of xlqid and make these changes and run it?

Regards

Willie

comment:18 Changed 4 years ago by avanni

Hi Willie,

Those time profiles we set up as part of the standard job and I have not changed them. I do not use any of the profiles that give you a warning in the check setup in my stash selection so I presume they are not the cause. These warnings also appear on the two other resolution runs and cause no problem with them.
Perhaps I should remove these profiles altogether?

Thanks,

Annelize

comment:19 Changed 4 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to assigned

BHi Annelize,

Leave them for now while we try to solve the current problem; long term they should be removed.

I tried to run your job over the weekend, but it has queued for the entire weekend - I don't have enough fair share priority it seems. So, could you please make a copy of xlqid and make the following changes

  • Compile and Run: tick the debug optimization button
  • Post Processing → Main → switch off post processing
  • Atmosphere → STASH and tick the "Deactivate diagnostics" button

Then do the NRUN/CRUNS as normal. I want to see if the problem remains after this.

Regards

Willie

comment:20 Changed 4 years ago by avanni

Hi Willie,

I have done the above but the run failed when I tried to perform the CRUN after the NRUN.
From the .leave file, I think this might be due to the fact that I don't have the STASH turned on and it, therefore, cannot get hold of the diagnostics required for nudging?

Annelize

comment:21 Changed 4 years ago by willie

Hi Annelize,

OK, switch nudging off too -we're not doing science, just trying to pin down the source of the problem.

Regards

Willie

comment:22 Changed 4 years ago by willie

  • Owner changed from willie to um_support

Hi Annelize,

If the test gets past time step 4464 then that indicates that some combination of nudging and STASH meaning is at fault; if it takes the square root of a negative number in the dynamics advection code at the same place then the problem is more fundamental, as I now believe it to be.

I am on annual leave now, and will open this ticket to the rest of the team.

I have noticed that the ancestor jobs xjaba and xkkgb are vastly different from one another. One has 10 branches and the other has only four. So I am wondering if you have all the branches that you need for this job.

Regards,

Willie

comment:23 Changed 4 years ago by avanni

Hi,

So I tried to rerun this job with all of the stash turned off, in debugging mode and with nudging turned off. It crashed at the same timestep again but I cannot see an error for a square root of a negative number. Also, it has attempted to produce a monthly mean file (xlqie.pm*.pp) even though I have turned of the STASH, which I had originally though might have been the problem.

I am not sure what is going on here.

Thanks,

Annelize

comment:24 Changed 4 years ago by grenville

Annelize

Your job has climate meaning switched on - I doubt that is the problem.

This has been a difficult one - now we are butting up against the end of the IBM; I doubt that we will have time on the IBM to resolve this.

Have you tried running on the new machine?

Grenville

comment:25 Changed 4 years ago by avanni

Hi Grenville,

I have just tried to submit the job to the new machine under the name 'xlqif' but I got an error in the reconfiguration. Something to do with the number of nodes requested by the reconfiguration but I am not sure that I can interpret the error correctly.

Thanks,

Annelize

comment:26 Changed 4 years ago by grenville

Annalize

That's a bug we'd not seen. We'll fix it, but I ran the reconfiguration - the reconfigured start file is

/projects/umadmin/gmslis/xlqjc/xlqjc.astart

You can use this file to run your model.

Grenville

comment:27 Changed 4 years ago by avanni

Hi Grenville,

I have tried to run my job from this .astart file but I am getting this in the .leave file:

ATP Stack walkback for Rank 628 starting:

_start@…:113
libc_start_main@…:242
um_main_@…:19
um_shell_@…:1132
u_model_4a_@…:1944
atm_step_4a_@…:6471
atmos_physics1_@…:2854
ni_gwd_ctl$ni_gwd_ctl_mod_@…:978
DEALLOCATE@0x22fd141
free@0x2c9b3a4

ATP Stack walkback for Rank 628 done
Process died with signal 11: 'Segmentation fault'

I am not sure why I am getting a Segmentation fault. I have never initialised a run from a .astart file without running the full reconfiguration. Perhaps I have done something wrong here?

Thanks,

Annelize

comment:28 Changed 4 years ago by grenville

Annelize

You haven't done anything wrong — the Cray has found a problem with the model which the IBM appears to have let through.

I'm trying to see what's happening - it's looking for

/nerc/ukca/analyses/era-in/ecm-e40_1deg-model-levs_2009120100_all.nc

Where is this now?

Grenville

comment:29 Changed 4 years ago by avanni

Hi Grenville,

with the migration the ukca folder containing this is now projects/ukca-admin/analyses/era-in/
so I am not sure that this is still the overarching issue.

I have changed the directory and it should start running.

Annelize

comment:30 Changed 4 years ago by avanni

… just to append to that.

I changed the directory to the correct one and I am still getting some error. How did you know that it was looking for that file? I can't see any hint of that in the .leave file?

Thanks,

Annelize

comment:31 Changed 4 years ago by grenville

The problem is in gwd - I switched that off and the model goes OK.

I'd like to run w/gwd on and nudging off — we have not had a problem w/gwd at 8.5 , but we've not run w/nudging.

Grenville

comment:32 Changed 4 years ago by grenville

Annelize

This job runs if you switch off the gravity-wave-drag diagnostics (section 6). We're working on what's causing the problem with gwd diagnostics, so if you don't need them, please try running again.

Grenville

comment:33 Changed 4 years ago by avanni

Hi Grenville,

Unfortunately the very thing that I am looking at is the response to the gravity wave drag to changes in resolution. Do you know if it is the spectral or orographic drag that is causing the problem?

Thanks,

Annelize

comment:34 Changed 4 years ago by grenville

Don't know yet, its x-component of surface ss0 stress which is causing the problem - do you know where that comes from?

Grenville

comment:35 Changed 4 years ago by avanni

Hi Grenville,

That is from the orographic gravity wave component. If that is the only diagnostic causing this then I can turn it off because it can be determined from other diagnostics.

I will try that and let you know if it runs.

Thanks,

Annelize

comment:36 Changed 4 years ago by grenville

Annalize

Please try - comments in the code indicate that stash 6 235 is the same as (one level of) 6 201.

Grenville

comment:37 Changed 4 years ago by avanni

Hi Grenville,

I removed the STASH 6235 and the model ran. However, it has again stopped at the 1st of January as it did on the IBM. There is no clear error in the .leave file.
The job id is xlqif.

Thanks,

Annelize

comment:38 Changed 4 years ago by grenville

Annalize

I think the problem is explained in the leave file, where you'll see lots of entries like

Wrong calendar setting in Ancillary File 5

Model run is set up for 365 day calendar.
Ancillary File is for 360 day calendar.

Your model fails at the new year of 2010, but it's confused about where the new year begins because of the calendar mismatch.

You need to decide if you want 360-day or 365-day calendar and setup the model accordingly. It looks like the ancillary files are 360-day, but the model is 365. It's probably easiest to change the model; navigate to model selection→input/output control…→ general config and check "Use 360 day calendar"

You'll need to ensure that all ancillary files are consistent.

Grenville

comment:39 Changed 4 years ago by avanni

Hi Grenville,

I have a modification using the branch fcm:um/branches/dev/jwalton/vn8.5_ignore_calendar_UKMO/src which makes the model ignore the ancillary date because I am nudging towards the ERA interim data, which is on the 365 day calendar. The model then updates from the 360 day calander, eventhough it is set to be 365.
This should apparently not be a problem when I am only running for 2 months. Mohit Dalvi suggested that I do this. I have used this setup in all of my model runs and they have worked?

If the model was free-running then I could change it to 360 but it wont work with the nudging unfortunately.

Annelize

comment:40 Changed 4 years ago by avanni

Hi Grenville,

I have noticed a difference between my resolutions that I think might be relevant.

in Model selection→Atmosphere/Scientific? Parameters…→ Spec of trace gases some of the values are not specified for the year 2011 onwards. Maybe the model needs these to update the year 2010 (maybe there is interpolation from the one year to the other?).

Shot in the dark but its worth a try?

Annelize

comment:41 Changed 4 years ago by grenville

Annelize

Please don't run this from 1 Dec 2009 yet — it'll take ages to reach the failure and your 81 nodes will lock out other attempts to debug the problem. You need to keep some start files so that you can rerun shorter jobs.

I doubt the 2011 files are an issue.

Grenville

comment:42 Changed 4 years ago by grenville

Annelize

You were right! My doubt was unfounded. It's the trace gasses causing the problem. I "Excluded" methane absorption etc and the model runs OK.

Grenville

comment:43 Changed 4 years ago by avanni

That's brilliant! Should I just add the years and values for Methane etc from the other resolution runs into that run and CRUN from where it crashed?

Thanks,

Annelize

comment:44 Changed 4 years ago by grenville

Yes I think that should be OK.

Grenville

comment:45 Changed 4 years ago by avanni

  • Resolution set to fixed
  • Status changed from assigned to closed

Hey,

Just to let you know, the model has completed successfully.

We can close the ticket now.

Thanks a lot for your help,

Annelize

Note: See TracTickets for help on using tickets.