Scheduler maintenance -- update - 9/10/2012

Posted in: HPC

Adaptive Computing have spent three 11-hour days assisting me with the installation and configuration of the new scheduling system on the Aquila HPC facility. During these days we have made a lot of progress and resolved several unforeseen issues. However there is some additional work remaining which largely involves systematically testing different queuing scenarios to ensure that the configuration is stable and is correctly scheduling the jobs.

Below is a summary of the work which has been accomplished over the last three days.

  • Moab v7.1.1 and Torque v4.1.1 have been installed on a dedicated server
  • the Aquila headnodes have been transformed into login/submit nodes
  • Torque v4.1.1 has been updated on all of the nodes and have been pointed to the new torque server
  • openMPI v1.6.1 has been recompiled to support the new version of torque
  • testing environment has been created to test the moab configuration
  • the workflow of moab configuration is largely completed

The majority of the moab configuration file reflects the outline to the scheduler I sent out in a previous email, we still have some of the more trivial elements to implement. The majority of the remaining works involves rigorously and systematically testing the various elements to the moab config file and some tuning; this will cover:

  • job preemption for development and course users
  • jobs which require license features
  • priority ordering
  • fairshare priority weighting
  • access control hacking
  • correct allocation of resources for types of jobs

This testing will be performed over the new few working days and during this time will have direct access to the support team and senior consultants from Adaptive Computing to assist me with any complications which arise. We anticipate that this work on the queue scheduler will be complete on Wednesday 10th October. Over the next few days I will be updating the HPC wiki pages with notes on how to use the new queuing system.

During this work a couple of the nodes have started to report issues. We currently have four nodes in a non-operational state. I'll be investigating these at a later date.

We thank you for your patience and are sorry for any inconvenience this extended maintenance and upgrade has caused.

Posted in: HPC