Opened 5 years ago

Closed 4 years ago

#1556 closed error (answered)

'SIGFPE - Floating-point exception' and 'Due to memory limitation eager limit is reduced...'

Reported by: avanni Owned by: um_support
Component: UM Model Keywords:
Cc: Platform: MONSooN
UM Version: 8.5

Description

Hello,

I am trying to run a relatively high resolution job (N512) on monsoon (Jobid xlaee).

However, I am getting the following error:

MPCI_MSG: ATTENTION: Due to memory limitation eager limit is reduced to 16384.
Usage: basename string [suffix]
qsserver: Waiting for command 2
Filtering initial dump data. n_filt= 8

Signal received: SIGFPE - Floating-point exception

Signal generated for floating-point exception:

FP overflow

Instruction that generated the exception:

fmul fr00,fr00,fr29

I am not sure what this means or how to go about fixing this.

Is it because the resolution is so high?

Change History (12)

comment:1 Changed 5 years ago by grenville

Annelize

The problem is a floating point exception - the leave file says roughly where:

Traceback:

Offset 0x00002a1c in procedure calc_div_ep_flux_mod_NMOD_calc_div_ep_flux_, near line 486 in file /projects/glomodel/avanni/xlaee/umatmos/ppsrc/UM/atmosphere/climate_diagnostic/calc_div_ep_flux.f90

there is a division near line 486 (zvpthp(y,z) / dthdz(y,z))

Did this job work before you got it?

Grenville

comment:2 Changed 5 years ago by avanni

Hi Grenville,

The job has been adapted from an N216 resolution run, which did work perfectly.

Might it be because I am starting from an N216 resolution start dump?

Annelize

comment:3 Changed 5 years ago by avanni

Just to add to this. I tried it with a N512 resolution start dump and I got an error saying 'INITTIME: Atmosphere basis time mismatch', which makes sense since the start dates are different. However, I am struggling to find a startdump in the right month for that resolution.

comment:4 Changed 5 years ago by grenville

Annelize

I don't think there is any problem with the start dump - your model ran OK for a while (36 timesteps). Please try running the original job with diagnostics switched off (Deactivate diagnostics in the stahsh window).

Grenville

comment:5 Changed 5 years ago by avanni

  • Resolution set to fixed
  • Status changed from new to closed

I have fixed this by using a different N512 job as an example and it runs fine now.

Thanks!

Annelize

comment:6 Changed 5 years ago by avanni

  • Resolution fixed deleted
  • Status changed from closed to reopened

comment:7 Changed 5 years ago by avanni

I thought it had been fixed since it ran for three months but I am now getting this error in the job named xlaef

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: io:buffin
? Error Code: 24
? Error Message: Error in buffin errorCode=3.00 len= 256/ 256
? Error generated from processor: 0
? This run generated 5 warnings
????????????????????????????????????????????????????????????????????????????????

comment:8 Changed 5 years ago by grenville

Annalize

Please tell us the jobid

Grenville

comment:9 Changed 5 years ago by avanni

Hi Grenville,

The job id is xlaef.

I am now getting the following error:

Could not load program xlaef.exe:
Symbol resolution failed for xlaef.exe because:

Symbol 1 (number /usr/lib/libc.a[shr_64.o]) is not exported from dependent

and it goes on like this….

Thanks,

Annelize

comment:10 Changed 5 years ago by grenville

Annelize

The previous problem:

????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: io:buffin
? Error Code: 24
? Error Message: Error in buffin errorCode=3.00 len= 256/ 256
? Error generated from processor: 0
? This run generated 5 warnings
????????????????????????????????????????????????????????????????????????????????

was apparently the result of the model not being able to find a file - see the leave file:

OPEN: File /projects/glomodel/avanni/xlaef/xlaefa.da19971219_00 to be Opened on Unit 21 does not Exist
OPEN: WARNING: FILE NOT FOUND
OPEN: Ignored Request to Open File /projects/glomodel/avanni/xlaef/xlaefa.da19971219_00 for Reading

We have seen the second error on MONSooN, but are not sure why it happens - it has gone away after a rebuild. Please try rebuilding the model.

Grenville

comment:11 Changed 4 years ago by grenville

Annelize

What's the status of this? Did you find out why the model failed to find the file?

Grenville

comment:12 Changed 4 years ago by grenville

  • Resolution set to answered
  • Status changed from reopened to closed

Ticket closed - lack of activity

Note: See TracTickets for help on using tickets.