Scheduler maintenance — update – 9/10/2012 | Advancing Research Computing

Adaptive Computing have spent three 11-hour days assisting me with the installation and configuration of the new scheduling system on the Aquila HPC facility. During these days we have made a lot of progress and resolved several unforeseen issues. However there is some additional work remaining which largely involves systematically testing different queuing scenarios to ensure that the configuration is stable and is correctly scheduling the jobs.

Below is a summary of the work which has been accomplished over the last three days.

Moab v7.1.1 and Torque v4.1.1 have been installed on a dedicated server
the Aquila headnodes have been transformed into login/submit nodes
Torque v4.1.1 has been updated on all of the nodes and have been pointed to the new torque server
openMPI v1.6.1 has been recompiled to support the new version of torque
testing environment has been created to test the moab configuration
the workflow of moab configuration is largely completed

The majority of the moab configuration file reflects the outline to the scheduler I sent out in a previous email, we still have some of the more trivial elements to implement. The majority of the remaining works involves rigorously and systematically testing the various elements to the moab config file and some tuning; this will cover:

job preemption for development and course users
jobs which require license features
priority ordering
fairshare priority weighting
access control hacking
correct allocation of resources for types of jobs

This testing will be performed over the new few working days and during this time will have direct access to the support team and senior consultants from Adaptive Computing to assist me with any complications which arise. We anticipate that this work on the queue scheduler will be complete on Wednesday 10th October. Over the next few days I will be updating the HPC wiki pages with notes on how to use the new queuing system.

During this work a couple of the nodes have started to report issues. We currently have four nodes in a non-operational state. I'll be investigating these at a later date.

We thank you for your patience and are sorry for any inconvenience this extended maintenance and upgrade has caused.

Posted in: Advancing Research Computing, High Performance Computing (HPC)

Aquila, High Performance Computing, HPC, maintenance, moab, scheduler

Advancing Research Computing

Subscribe to Advancing Research Computing

Scheduler maintenance -- update - 9/10/2012

Advancing Research Computing

Subscribe to Advancing Research Computing

Scheduler maintenance -- update - 9/10/2012

Read next

HPC Symposium 2019

HPC Symposium 2019 - Abstract Submission

HPC Symposium 2019 - First Announcement