Opened 7 years ago

Closed 6 years ago

#1048 closed help (fixed)

Job crashes with segmentation fault.

Reported by: Emma_Turner Owned by: willie
Component: UM Model Keywords: segmentation fault, totalview
Cc: Platform: HECToR
UM Version: 6.6.3

Description

Hello,

I have got to the stage where the model is starting to run (two pbh output files are produced - one with data and the next without) and is then failing with Segmentation faults, that part of the .leave file is shown below.

_pmiu_daemon(SIGCHLD): [NID 00242] [c2-0c2s6n0] [Thu Apr 4 15:48:22 2013] PE RANK 48 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 00243] [c2-0c2s6n1] [Thu Apr 4 15:48:22 2013] PE RANK 80 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 00204] [c2-0c2s6n2] [Thu Apr 4 15:48:22 2013] PE RANK 112 exit signal Segmentation fault
_pmiu_daemon(SIGCHLD): [NID 00205] [c2-0c2s6n3] [Thu Apr 4 15:48:22 2013] PE RANK 144 exit signal Segmentation fault
[NID 00205] 2013-04-04 15:48:22 Apid 4179295: initiated application termination
diff: /work/d43/d43-geos-climate/eturner/tmp/tmp.hector-xe6-14.12255/xihtb.xhist: No such file or directory
qsexecute: Copying /work/d43/d43-geos-climate/eturner/um/xihtb/W/xihtb.thist to backup thist file /work/d43/d43-geos-climate/eturner/um/xihtb/W/xihtb.thist_keep
xihtb: Run failed

We are trying to diagnose the cause of this failure. We have been trying to use totalview as recommended by HECToR but aren't having much luck. What would you recommend for debugging and how would we go about it? We have clicked maximum error message options in the umui. I have attached our .leave file.

Thanks!
Emma

Change History (17)

comment:1 Changed 7 years ago by Emma_Turner

.leave file is proving difficult to attach..

comment:2 Changed 7 years ago by ros

Hi Emma,

Just tell us the name of the .leave file on HECToR and we'll take a look at it.

Cheers,
Ros.

comment:3 Changed 7 years ago by Emma_Turner

Hi Ros,

The name is xihtb000.xihtb.d13094.t151154.leave and it is in directory /home/d43/d43-geos-climate/eturner/um/umui_out

Many thanks for looking at it.
Emma

comment:4 Changed 7 years ago by willie

Hi Emma,
Could you give us read permission please:

chmod -R g+rX /home/d43/d43-geos-climate/eturner

Thanks,

Willie

comment:5 Changed 7 years ago by ros

Hi Emma,

We are unable to see any of the disk space under the /d43 group on HECToR. If you transfer the .leave file to PUMA we will be able to take a look at it and it may then be obvious what is wrong. However, we still won't be able to access any of the model output directories should be we need to to see what else is going on. I believe Grenville gave you an account under n02. If you could use your /home/n02/n02/eturner and /work/n02/n02/eturner directories we will be able to help much easier if you have any problems.

Regards,
Ros.

comment:6 Changed 7 years ago by Emma_Turner

Hi,

Hopefully you now have read permissions in d43-geos-climate/eturner. Let me know if not. There is a problem with using the n02 directories as my home as I don't think I have any space allocated to deposit model files as I don't have an official ncas account, I am part of a different project. We have had problems this last week with my default home directory being different to the d43-geos-climate and after trying to work around it we had to ask HECToR to change $HOME from their end, I don't want to change this again as it will bring up a new set of similar problems I fear.

Thanks
Emma

comment:7 Changed 7 years ago by ros

Hi Emma,

As I said above we are unable to see anything below /home/d43 so changing permissions on your /home/d43/d43-geos-climate/eturner will not help.

Please copy your .leave file to PUMA and hopefully we will be able to see what is going wrong.

Cheers,
Ros.

comment:8 Changed 7 years ago by Emma_Turner

Hi Ros,

Sorry I didn't know about the d43 permissions, I have copied file xihtb000.xihtb.d13094.t151154.leave into /home/Emma_Turner on PUMA.

Thanks
Emma

comment:9 Changed 7 years ago by willie

  • Owner changed from um_support to willie
  • Status changed from new to accepted

Hi Emma,

Your leave file has some NaNs? at line 1211676. Since this occurs before the first time step, it suggests that one or more of the input files are corrupt. The model then runs for another 55 time steps and then has a segmentation fault which generates the core file. Looking at the core file with the debugger is not going to be helpful, since the NaNs? were present at the start.

The message "WARNING inbalance in water budget" may be significant - it occurs very early on and is corrupted in a number of places.

To check input files for NaNs?, compare them with themselves using cumf:

  cumf -dOUT ~ file1 file1

This puts the results in the home directory. You only need to look at the summary file. There should be no differences reported for a good file.

NaNs? can also be generated by incorrect code, so if that might also be a place to look.

Regards,

Willie

Last edited 7 years ago by willie (previous) (diff)

comment:10 Changed 7 years ago by Emma_Turner

Hi Willie,

Thanks I am systematically going through the ancillary files with this method and recovering no errors, apart from in 'Model Selection > Atmosphere > Ancillary and input data files > Climatologies & Potential climatologies > Natural climate forcing' I am getting the following errors:

Volcanic forcing file

cumf -dOUT ~ /work/n02/n02/hum/hg6.6.3/HG2AMIP_ancils/volcts_sato02e.dat /work/n02/n02/hum/hg6.6.3/HG2AMIP_ancils/volcts_sato02e.dat
/work/n02/n02/hum/hg6.6.3/cce/utils/cumf: line 165: 16833: Memory fault
Problem with CUMF program
cat: /home/d43/d43-geos-climate/eturner/cumf_temp.16832: No such file or directory
Summary in:                       ,/home/d43/d43-geos-climate/eturner/cumf_summ.16832
Full output in                    ,/home/d43/d43-geos-climate/eturner/cumf_full.16832
Difference maps (if available) in:,/home/d43/d43-geos-climate/eturner/cumf_diff.16832
rm: cannot remove `/home/d43/d43-geos-climate/eturner/cumf_temp.16832': No such file or directory

Solar forcing file

cumf -dOUT ~ /work/n02/n02/hum/hg6.6.3/HG2AMIP_ancils/scvary_l09a.dat /work/n02/n02/hum/hg6.6.3/HG2AMIP_ancils/scvary_l09a.dat
/work/n02/n02/hum/hg6.6.3/cce/utils/cumf: line 165: 8776: Memory fault
Problem with CUMF program
cat: /home/d43/d43-geos-climate/eturner/cumf_temp.8775: No such file or directory
Summary in:                       ,/home/d43/d43-geos-climate/eturner/cumf_summ.8775
Full output in                    ,/home/d43/d43-geos-climate/eturner/cumf_full.8775
Difference maps (if available) in:,/home/d43/d43-geos-climate/eturner/cumf_diff.8775
rm: cannot remove `/home/d43/d43-geos-climate/eturner/cumf_temp.8775': No such file or directory

Do you know what might be causing this error?

Thanks
Emma

comment:11 Changed 7 years ago by willie

Hi Emma,

The cumf tool compares UM files. The forcing files are just ASCII text: you can see this by running the `file' command in them. They won't contain any NaNs?.

Regards,

Willie

comment:12 Changed 7 years ago by Emma_Turner

Hi Willie,

I have systematically been through the ancillary files and found no NaNs? using the cumf tool. When you say incorrect code do you mean the actual code within the UM? I haven't applied any extra coding to it myself. I'm not quite sure how to diagnose the problem, would it help if I got world read access to my d43 directory?

Thanks
Emma

comment:13 Changed 7 years ago by willie

Hi Emma,

Did the job xihta run successfully? Assuming that it did, there are three main areas where xihtb differs:

  • hand edits
  • Ancillary volcanic forcing
  • STASH

You can do a Job > Difference in the UMUI to see the details. It may be helpful to examine these areas one by one.

Regards,

Willie

comment:14 Changed 6 years ago by Emma_Turner

Hi Willie,

xihta did not run successfully, we just made the copy to make some changes from our original, the errors are the same. However xihta is copied from another job that did run sucessfully, however this job was run in the n02 environment instead of d43. A difference of xihtb with this job shows that apart from user details and home spaces the differences were STASH codes. I tried removing extra diagnostics that I had put in (I put in lots of cloud related ones - I thought maybe a corrupt one was playing havoc with the water budget) but the error was the same..

Regards,
Emma

comment:15 Changed 6 years ago by willie

Hi Emma,

It looks like the hand edit triffid_never.sh is not working properly. See the file ~Emma_Turner/umui_jobs/xihtb/EXT_SCRIPT_LOG for details. I think if you remove the "/tmp/" everywhere in this file it should work: you have no access to /tmp on PUMA.

Many versions of the UM display the results of the hand edits in a window when you "process", but in 6.6.3 you have to check the log file.

Regards

Willie

comment:16 Changed 6 years ago by Emma_Turner

Hi Willie,

I copied the triffid_never.sh file to my own PUMA homespace and took out all the /tmp/ references. The model has now run successfully. Thankyou for helping me with this.

Kind regards
Emma

comment:17 Changed 6 years ago by willie

  • Resolution set to fixed
  • Status changed from accepted to closed
Note: See TracTickets for help on using tickets.