Opened 12 years ago

Closed 12 years ago

#119 closed help (fixed)

slow turnround for HadCM3 on HPCx

Reported by: jonathan Owned by: lois
Component: UM Model Keywords:
Cc: Platform:
UM Version: 4.5

Description

Dear helpdesk

I am trying to run a HadCM3 job for the first time on HPCx. In 12 h on 16 PEs it can run 5 years. My job (for the first 5 years) has been queued in parn16_12 since 13th (six days). I had the impression that turnround was better than this for HadCM3 and even HadGEM1. Am I using the wrong queue? It's not good to have to wait a week for each 5-year job! Perhaps I should submit it in very small bits in the 1-h queue instead?

Thanks for your help

Jonathan

Change History (5)

comment:1 Changed 12 years ago by lois

  • Owner changed from um_support to lois
  • Status changed from new to assigned

The queues are particularly long on the HPCx development system at the moment.

For runs that can run efficiently on 32 processors, which is the case for HADCM3, then you could move to the main HPCx system and run your job on either of the queues

  • 32 processor 48 hours
  • 32 processor 12 hours
  • 32 processor 6 hours
  • 32 processor 3 hours
  • 32 processor 1 hour

To run on the HPCx main system you will need to be in the n02-bjob group (I can add you to the group) and you will need to change the tic code in the UMUI from n02-ncas to n02-bjob.
You can continue to use $DEVTDIR for output providing you have the sticky bit set on all levels of the directories you are using. Check using ls -al, it should look like this drwsr-s—-
If you use the HPCx main system then your priority is determined by all users of HPCx; EPSRC, NERC etc and by how many large jobs are in the queue, so a different priority system than that of the NERC development system.
I will have to monitor the CPU allocation for n02-bjob (a task I do anyway) as the main system is charged rather than the 'free at the point of usage' for the development system.

An alternative solution would be to move to HECToR where UM vn4.5 is installed. My only reluctance to offer this to everyone is that there is still a problem with disk space management that Cray and HECToR are urgently working on. It doesn't stop you using the system but it could be an administrative nightmare when disk allocations are finally imposable, Hopefully after Easter we will have news of when a full service can be implemented.

Let me know which solution you would like to try to improve throughput.

Lois

comment:2 Changed 12 years ago by jonathan

Dear Lois

Thanks for your helpful reply. I think I would like to use the HPCx main system for the moment, please, as I don't want to cause you administrative nightmares on HECToR. I am not planning at present to run a lot of HadCM3; 10 years initially, and I might do a century or two later. I do apparently have the sticky bit set on my DEVTDIR (I also have global read-permission). I probably will leave my current 10-year job in the development queue because I will be on holiday next week and I hope it might manage to complete anyway before I return!

Best wishes

Jonathan

comment:3 Changed 12 years ago by lois

Hello Jonathan,

You should now be in the n02-bjob group, so after Easter please use the HPCx main system and let me know if this improves your throughput. Other users may wish to try this solution.

Lois

comment:4 Changed 12 years ago by jonathan

HPCx appears to be disfavouring me in parn16_12 (I left it there, as discussed) as jobs submitted later by other people are running, while mine still isn't (after a week):

l1f401.316317.8          carslaw     3/11 14:30 R  50  parn16_12    l6f410     
l1f401.316353.6          carslaw     3/11 14:58 R  50  parn16_12    l2f410     
l1f401.317276.5          eldvs       3/13 08:23 R  50  parn16_12    l7f409     
l1f402.318402.5          eldvs       3/13 08:23 R  50  parn16_12    l6f409     
l1f402.319626.0          emoliv      3/16 06:52 R  50  parn16_12    l5f409     
l1f401.318606.0          elsd        3/16 17:28 R  50  parn16_12    l8f409     
l1f401.318625.0          empaul      3/16 18:18 R  50  parn16_12    l3f410     
l1f402.319751.0          empaul      3/16 18:18 R  50  parn16_12    l7f410     
l1f401.318865.0          eldvs       3/17 08:58 R  50  parn16_12    l1f410     
l1f401.318874.0          carslaw     3/17 09:34 I  50  parn16_12               
l1f402.320051.0          aosprey     3/17 10:49 I  50  parn16_12               
l1f401.319000.0          earjme      3/17 12:13 I  50  parn16_12               
l1f402.320126.0          earjme      3/17 12:13 I  50  parn16_12               
l1f401.319001.0          earjme      3/17 12:13 I  50  parn16_12               
l1f402.320873.0          earmgf08    3/18 15:21 I  50  parn16_12               
l1f401.319880.0          earjme      3/18 20:34 I  50  parn16_12               
l1f402.318594.0          gregory     3/13 13:11 S  50  parn16_12     

I wonder what status "S" means, that only my job has.

Jonathan

comment:5 Changed 12 years ago by lois

  • Resolution set to fixed
  • Status changed from assigned to closed
Note: See TracTickets for help on using tickets.