Opened 12 years ago

Closed 12 years ago

#92 closed help (fixed)

Segmentation fault in call to GCG_RVECSUMR

Reported by: rjel@… Owned by: um_support
Component: UM Model Keywords:
Cc: Platform:
UM Version:

Description

Hi,
In polar1 there is a call to the above subroutine which seems to crash the model.
The job is called
xcpwe it is a (432*325 Global run)
and the scripts and output from the compilation and run are stored in
/hpcx/home/n02/n02/richie/umui_runs

the mod I have written for print statements is in
/hpcx/home/n02/n02/richie/um_mods
and is called
debug_um.mod

there is also a hand edit in ~richie called
insert_for_cntlatm
because the model fails to read CNTRLATM otherwise.

In addition to the experiment shown I have submitted a few jobs with different processor configurations i.e. 4*8, 8*4, 4*4 and finally 8*16.
The final job gets further down the line than my original try but segs further on. I am looking into this problem now.

My question is: is there an optimal configuration of processors for a 432*325 horizontal grid or, is there a mod that overcomes this seg problem.

Thanks in advance for your help,
Richard
rjel@…

Change History (11)

comment:1 Changed 12 years ago by lois

To run the global model at this resolution (432 x 325) you may need to be running on the main HPCx system (rather than the development system) and you may need to run it on 128 or 256 processor. To enable you to try this I will put you in the subgroup n02-bjob and you will have to rerun your model with the tic-code set in the UMUI to be n02-bjob rather than n02-ncas. The queueing system on the main HPCx system is different and so your turnaround may be affected.

To run at this resolution you may also need to change the STACK size you set in the UMUI. I don't have access permission on all your files in /hpcx/home/n02/n02/richie/umui_runs
but from the UMUI you have the stack size set to 200 Mbytes, so you may need to increase this to say 600 Mbytes. It is difficult to give precise answers as to the optimum value as it does depend on what you are running.

The other problem you may meet is that the at this resolution the messages that you are sending and receiving are too large for the current buffer size of the gcom library, which controls all the UM communication and includes GCG_RVECSUMR. We have libraries with larger buffer sizes so you could recompile your code using one of these bigger buffered libraries. You will have to create a compile override file on HPCx with the line
@load LCOM_LIBS=-lgcom_mpi_buffered_bigggerbuff -lgrib
and then include this compile override file in the UMUI. I see that you already have such a compile override but I don't have permission to see the file and which library you are actually using.

Let me know what happens.

Lois

comment:2 Changed 12 years ago by umdoc

Hi Lois,
I am continuing to run into problems and thought I would update you, in case you can help out again.
I tried running the job on bjob and using the compile override you suggested.

The result was:

mpxlf90_r blkdata.o umshell1.o libum1.a -q64

-lgcom_mpi_buffered_bigbuff -lgrib -L,
-L/hpcx/home/n02/n02/umx/gcom/um5.5/lib -L/hpcx/home/n02/n02/umx/lib -o /hpcx/devt/n02/n02-ncas/richie/um/dataw/phd_debug.exec

ld: 0711-317 ERROR: Undefined symbol: .alog_v
ld: 0711-317 ERROR: Undefined symbol: .oneover_v
ld: 0711-317 ERROR: Undefined symbol: .powr_v
ld: 0711-317 ERROR: Undefined symbol: .rtor_v
ld: 0711-317 ERROR: Undefined symbol: .exp_v

ld: 0711-317 ERROR: Undefined symbol: .sqrt_v
ld: 0711-345 Use the -bloadmap or -bnoquiet option to obtain more information.
make: 1254-004 The error code from the last command is 8.

The out out file is called:
xcpwg000.xcpwg.d07303.t095324.comp.leave

I then reverted back to my old bigbuffer script, which I have called:

old_big_buffer,

and 128 processors. This ran for 2 and a bit time steps, producing this output file:
xcpwg000.xcpwg.d07303.t100222.leave

I tried moving to 256 processors and this stopped in timestep 2, with this output file:
xcpwg000.xcpwg.d07303.t163354.leave

I am now going to combine the override commands by adding -lgrib to my original big buffer script and I will add back in my print statements to check that it is crashing in the usual place.

The overide scripts are in:
$HOME/um_mods
the output files are in
$HOME/output_dir/umui_out
and the umui submission scripts are in
$HOME/umui_runs

I hope I have got the permissions right.
Thank you for all your efforts,
Richard

comment:3 Changed 12 years ago by lois

Sorry Richard I gave you the compile override options for UM vn6.1 instead of UM vn4.5. HPCx is down this afternoon so when it returns you will need to look at the compile_vars file for UM vn4.5 which will have the necssary load options for the libraries which will include the vect library that includes the undefined symbols.

We will have another look at this problem when HPCx returns from routine maintenace.

Lois

comment:4 Changed 12 years ago by umdoc

Hi Lois,
I tried to submit a job this morning using the libraries identified in compile_vars file and was told the n02-bjob was no longer a valid account number for richie. I wonder if you could re-add me to the account.
Thank you,
Richard

comment:5 Changed 12 years ago by lois

Hello Richard, I have checked and you are still in the n02-bjob group but there was a temporary lack of resources for this sub-project. We are due to start our next tranche of NERC allocation soon so it is difficult to manage the last bits of allocation and balance them between all the sub projects, so n02-bjob ran out of allocation.

Please try again and hopefully your problem will be solved. Let me know what happens.

Thanks
Lois

comment:6 Changed 12 years ago by umdoc

Hello Lois,
I tried running the job again and it is still crashing. The output files are in
~/outputdir/umui_out/
the job name is
xcpwg
xcpwg000.xcpwg.d07311.t110643.leave
xcpwg000.xcpwg.d07311.t110643.comp.leave

As far as I can see the compilation is picking up the gcom libs.

-lgcom_mpi_buffered_bigbuff

is used in make/link command in the com.leave file.
Does this imply I cannot run the um at the 432*325 resolution?

Thank you for your persistence,
Richard

comment:7 Changed 12 years ago by lois

Sorry for the delay in replying Richard.
The answer is probably noone has run UM vn4.5 global at this resolution. I remember trying, and failing to do so on Turing so that is a few years ago! Simon Wilson says that PRECIS (which is UM vn 4.5) is effectively at this resolution although a regional version of course. To get this running they needed to retune the model, changing diffusion parameters and other parameteristion parameters.
Do you need to run the global run at this resolution?
Do you need to use UM version 4.5, as you certainly run UM version 6.1 at this resolution?
So apologies, more questions than answers.

Lois

comment:8 Changed 12 years ago by umdoc

Hello Lois,
Don't worry about any delay, I realise your plate is probaly more than full.

I am running this configuration because it was the configuration I ran on Newton
for my PhD. I have been asked to do more runs in an ensemble before resubmiting
my thesis in January.

This run is only for the LBCs so I could possibly run it at 96*73 ( I am reconfiguring the start dump at the moment to try this).

Then I will need to run a mesoscale limited area runs (I used 229*132, at 0.11 deg
resolution before). It looks like this will be tough too. I can maybe cut down this
domain a bit but I need to keep the area fairly large.

I am not sure how much effort it would take to get the start dump and ancillaries for 6.1. Can it run with VN4.5 ancillaries? I have some soil moisture and vegetation ancillaries I need to use in the ensembles.

If this is the pasth of least resistance I guess I will have to switch versions.

Thank you again for your time,
Richard

comment:9 Changed 12 years ago by lois

No you will not be able to use UM vn4.5 ancillaries with vn6.1 so I would stick to vn4.5 for your resubmission. Why not try N48 global runs for creating your LBCs, I don't know the effect of low resolution LBc's on your limited area runs. You can run UM vn4.5 at N96 or N144, it is just N216 that we don't have running example jobs.
I also don't know the effect of having some ensemble members from Newton and some from HPCx., but there is not much we can do about this.

Lois

comment:10 Changed 12 years ago by umdoc

Helllo Lois,
I shall give the different resolutions a go then. I have a job for N48 but not for N96 or N144. Could you point me in the direction of some example jobs on PUMA as the a cursory glance at the NCAS UM site seems to only have N48.
Thank you,
Richard.

comment:11 Changed 12 years ago by lois

  • Resolution set to fixed
  • Status changed from new to closed

On PUMA I can find only N96 jobs that were prepared for Newton not HPCx, these are in the experiment xbut. But for N144 there are Nick Klingaman's jobs in the experiment xbxc.

I hope it all works. Can I close this query now? You can still submit further queries if you are still running into probmes.

Thanks
Lois

Note: See TracTickets for help on using tickets.