Most University of Bath IT planned maintenance takes place on Tuesday mornings between 7am and 9am. This blog will let you know what, if any, IT maintenance will be taking place each Tuesday morning.
Tuesday 9 June 2015 at risk* period - we will not be carrying out maintenance and upgrades - services normal.
IT services status
Find our blog quickly using go.bath.ac.uk/it-status
To avoid confusion we have turned comments off this blog as we want you to get help with any IT queries as quickly as you can. If you are having any problems with your IT contact us online, over the phone or face to face.
- Online: use our IT Help Form
- Over the phone: 01225 383434 or internally at extension 3535 - (Monday to Friday: 9am to 5pm)
- Face to face: visit us in the Library, Level 2 at the service desk (Monday to Friday: 9am (Wednesdays 10am) to 5pm
*Our 'at risk' period is between 7am and 9am on Tuesdays when we carry out scheduled maintenance, modifications and testing. This work is essential to maintain and develop the services that we provide. Thank you for your patience during the maintenance period.
Aquila is down for maintenance due to a master node failure and issues with fail-over consequently effecting job execution.
Service will resume by Friday (Nov 21) morning.
The Aquila HPC cluster will be unavailable from 16:30 on Monday 9th June until the morning of 10th June, while essential maintenance is carried out on the /home area.
The storage server which provides the /home area for the Aquila HPC has been experiencing some technical issues and as a result the /home area will be experiencing a degraded performance for the time being.
To rectify the issues the storage server experienced earlier the /home area will be unavailable tomorrow morning, 19th Sept 2013 from 8am. To clarity, this will only be affecting the storage servers which provide the /home area. The Aquila HPC facility will be still available during this period, however you may experience issues when trying to log in or access files/data under /home.
The scheduler has been paused will remain in a paused state until the work has been completed, so no new jobs will begin running on the compute node; however you can still submit workload to the scheduler.
With the interruption we experienced with the /home area, the current running jobs may have experienced a potential loss of data should they have relied upon the /home area. If you have a job running then I urge you to check the job outputs to make sure that the jobs I/O has cleanly resumed when /home filesystem was restored.
Adaptive Computing have spent three 11-hour days assisting me with the installation and configuration of the new scheduling system on the Aquila HPC facility. During these days we have made a lot of progress and resolved several unforeseen issues. However there is some additional work remaining which largely involves systematically testing different queuing scenarios to ensure that the configuration is stable and is correctly scheduling the jobs.
Below is a summary of the work which has been accomplished over the last three days.
- Moab v7.1.1 and Torque v4.1.1 have been installed on a dedicated server
- the Aquila headnodes have been transformed into login/submit nodes
- Torque v4.1.1 has been updated on all of the nodes and have been pointed to the new torque server
- openMPI v1.6.1 has been recompiled to support the new version of torque
- testing environment has been created to test the moab configuration
- the workflow of moab configuration is largely completed
The majority of the moab configuration file reflects the outline to the scheduler I sent out in a previous email, we still have some of the more trivial elements to implement. The majority of the remaining works involves rigorously and systematically testing the various elements to the moab config file and some tuning; this will cover:
- job preemption for development and course users
- jobs which require license features
- priority ordering
- fairshare priority weighting
- access control hacking
- correct allocation of resources for types of jobs
This testing will be performed over the new few working days and during this time will have direct access to the support team and senior consultants from Adaptive Computing to assist me with any complications which arise. We anticipate that this work on the queue scheduler will be complete on Wednesday 10th October. Over the next few days I will be updating the HPC wiki pages with notes on how to use the new queuing system.
During this work a couple of the nodes have started to report issues. We currently have four nodes in a non-operational state. I'll be investigating these at a later date.
We thank you for your patience and are sorry for any inconvenience this extended maintenance and upgrade has caused.