Opened 9 years ago

Closed 9 years ago

#687 closed help (fixed)

lib-4205 : UNRECOVERABLE library error The program was unable to request more memory space.

Reported by: swr07dmm Owned by: um_support
Component: UM Model Keywords: HECToR
Cc: Platform:
UM Version: 6.6.3

Description

Hi, I am getting this error when reconfiguring - the job ID is xgklb and its a copy of the job id xgkla with a few modifications. Where xgkla works fine. Do you know what might be wrong?

Change History (8)

comment:1 Changed 9 years ago by ros

  • Keywords HECToR added
  • UM Version changed from <select version> to 6.6.3

Hi Dann,

Can you change the permissions on your HECToR /home and /work directories so that we can see them please?

chmod -R g+rx /home/n02/n02/dmitch

and similarly for /work

Regards,
Ros.

comment:2 Changed 9 years ago by swr07dmm

Hi Ros, I've changed the permissions now.

thanks,
Dann

comment:3 follow-up: Changed 9 years ago by grenville

Dann

One way to increase the memory available to the program is to run the reconfiguration on say 12 cores/node rather than the default of 24. So try setting Sub-Model Independent→Job submission, reso..→Use non-default number of cores per node to 12. I'd strongly advise you to run this as a reconfiguration-only job until you have a valid start file. Running with 12 cores/node is 2x as expensive as running with 24 cores/node, but shouldn't be a problem for the reconfiguration.

Grenville

comment:4 in reply to: ↑ 3 Changed 9 years ago by swr07dmm

Replying to grenville: Hi Grenville, I've tried that but with no luck, I still seem to get the same error message.

thanks,
Dann

Dann

One way to increase the memory available to the program is to run the reconfiguration on say 12 cores/node rather than the default of 24. So try setting Sub-Model Independent→Job submission, reso..→Use non-default number of cores per node to 12. I'd strongly advise you to run this as a reconfiguration-only job until you have a valid start file. Running with 12 cores/node is 2x as expensive as running with 24 cores/node, but shouldn't be a problem for the reconfiguration.

Grenville

comment:5 Changed 9 years ago by grenville

Dann

The problem is that the start files for xgklb have the wrong endianess. You can see this is the case by using xconv and looking in the lower left panel

file /work/n02/n02/dmitch/RESTART/xfwnla.da83c10 is a 64 bit ieee um file

and

file /work/n02/n02/hum/hg6.6.3/HG2CCL60_ancils/akgiea.dai5c10 is a byte swapped 64 bit ieee um file

You can use /work/n02/n02/hum/vn6.1/utils/bigend-1.1/bigend to swap the endianess

/work/n02/n02/hum/vn6.1/utils/bigend-1.1/bigend: usage: -3264 file1 file2

Grenville

comment:6 Changed 9 years ago by swr07dmm

Hi Grenville,

the problem is that the ocean restart dump can't be byteswapped. I'm not sure why this is but if you xconv the original (i.e. xconv /work/n02/n02/dmitch/RESTART/xfwnlo.da83c10) it says there are errors reading extra data. When I have run this job on monsoon that has not been a problem though. If I then byteswap this file (i.e. xconv /work/n02/n02/dmitch/RESTART/xfwnlo.da83c10bs) then xconv can't even open it.

I've also tried creating the .astart and .ostart files on monsoon then just skipping the reconfiguration step on hector, but that doesn't seem to work either.

many thanks,
Dann

comment:7 Changed 9 years ago by grenville

Dann

xconv has problems with ocean files sometimes. Please type

export MALLOC_CHECK_=0

just before running it on an ocean dump.

I'm not sure what you mean when you say the file can't be byteswapped. Did the byteswap program fail?

Grenville

comment:8 Changed 9 years ago by grenville

  • Resolution set to fixed
  • Status changed from new to closed
Note: See TracTickets for help on using tickets.