Opened 6 years ago
Closed 6 years ago
#1556 closed error (answered)
'SIGFPE - Floating-point exception' and 'Due to memory limitation eager limit is reduced...'
Reported by: | avanni | Owned by: | um_support |
---|---|---|---|
Component: | UM Model | Keywords: | |
Cc: | Platform: | MONSooN | |
UM Version: | 8.5 |
Description
Hello,
I am trying to run a relatively high resolution job (N512) on monsoon (Jobid xlaee).
However, I am getting the following error:
MPCI_MSG: ATTENTION: Due to memory limitation eager limit is reduced to 16384.
Usage: basename string [suffix]
qsserver: Waiting for command 2
Filtering initial dump data. n_filt= 8
Signal received: SIGFPE - Floating-point exception
Signal generated for floating-point exception:
FP overflow
Instruction that generated the exception:
fmul fr00,fr00,fr29
I am not sure what this means or how to go about fixing this.
Is it because the resolution is so high?
Change History (12)
comment:1 Changed 6 years ago by grenville
comment:2 Changed 6 years ago by avanni
Hi Grenville,
The job has been adapted from an N216 resolution run, which did work perfectly.
Might it be because I am starting from an N216 resolution start dump?
Annelize
comment:3 Changed 6 years ago by avanni
Just to add to this. I tried it with a N512 resolution start dump and I got an error saying 'INITTIME: Atmosphere basis time mismatch', which makes sense since the start dates are different. However, I am struggling to find a startdump in the right month for that resolution.
comment:4 Changed 6 years ago by grenville
Annelize
I don't think there is any problem with the start dump - your model ran OK for a while (36 timesteps). Please try running the original job with diagnostics switched off (Deactivate diagnostics in the stahsh window).
Grenville
comment:5 Changed 6 years ago by avanni
- Resolution set to fixed
- Status changed from new to closed
I have fixed this by using a different N512 job as an example and it runs fine now.
Thanks!
Annelize
comment:6 Changed 6 years ago by avanni
- Resolution fixed deleted
- Status changed from closed to reopened
comment:7 Changed 6 years ago by avanni
I thought it had been fixed since it ran for three months but I am now getting this error in the job named xlaef
????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: io:buffin
? Error Code: 24
? Error Message: Error in buffin errorCode=3.00 len= 256/ 256
? Error generated from processor: 0
? This run generated 5 warnings
????????????????????????????????????????????????????????????????????????????????
comment:8 Changed 6 years ago by grenville
Annalize
Please tell us the jobid
Grenville
comment:9 Changed 6 years ago by avanni
Hi Grenville,
The job id is xlaef.
I am now getting the following error:
Could not load program xlaef.exe:
Symbol resolution failed for xlaef.exe because:
Symbol 1 (number /usr/lib/libc.a[shr_64.o]) is not exported from dependent
and it goes on like this….
Thanks,
Annelize
comment:10 Changed 6 years ago by grenville
Annelize
The previous problem:
????????????????????????????????????????????????????????????????????????????????
???!!!???!!!???!!!???!!!???!!!???!!! ERROR ???!!!???!!!???!!!???!!!???!!!???!!!?
? Error in routine: io:buffin
? Error Code: 24
? Error Message: Error in buffin errorCode=3.00 len= 256/ 256
? Error generated from processor: 0
? This run generated 5 warnings
????????????????????????????????????????????????????????????????????????????????
was apparently the result of the model not being able to find a file - see the leave file:
OPEN: File /projects/glomodel/avanni/xlaef/xlaefa.da19971219_00 to be Opened on Unit 21 does not Exist
OPEN: WARNING: FILE NOT FOUND
OPEN: Ignored Request to Open File /projects/glomodel/avanni/xlaef/xlaefa.da19971219_00 for Reading
We have seen the second error on MONSooN, but are not sure why it happens - it has gone away after a rebuild. Please try rebuilding the model.
Grenville
comment:11 Changed 6 years ago by grenville
Annelize
What's the status of this? Did you find out why the model failed to find the file?
Grenville
comment:12 Changed 6 years ago by grenville
- Resolution set to answered
- Status changed from reopened to closed
Ticket closed - lack of activity
Annelize
The problem is a floating point exception - the leave file says roughly where:
there is a division near line 486 (zvpthp(y,z) / dthdz(y,z))
Did this job work before you got it?
Grenville