During Inter-Semester break, on Thursday 2 February between 09:00 and 13:00, we will be placing the cluster into maintenance mode whilst we perform failover tests between the pair of master nodes and BeeGFS node couplets.
These tests will ensure that pairs of master nodes and BeeGFS node couplets are in good working order should an unexpected system issue occur that triggers a system failover.
While the system failovers are being tested, all users will be able to access data from /home and /beegfs, but you may notice a momentary freeze while the storage areas are transferred between the failover pairs. All compute nodes will be placed into a scheduler reservation to prevent any workloads from running while these tests are carried out.
Sorry for the short notice of this announcement, I hope this will not cause too much disruption for anyone.
The maintenance work will begin on Monday 8th August and is expected to take up to a week to complete. During this maintenance window there will be no access to the Balena system and all queued jobs will need to be cleared from the scheduler.
The majority of this work will be performed by ClusterVision. We are anticipating needing a full week to give ClusterVision and ourselves enough time to complete these maintenance tasks. We shall open up access once all disruptive tasks have been completed.
Below is a list of some of the maintenance work which will be taking place:
- Upgrading the SLURM scheduler, security patching and enabling new features
- Testing SLURM's node power management
- Enabling global file locking on the BeeGFS scratch partition
- ClusterVision will also be configuring new system monitoring tools
On Monday 11th April, the Balena HPC cluster will be unavailable due to maintenance work. Part of this work will include ClusterVision performing a full health check to ensure Balena is running at optimal performance. During this time, all jobs will be held in the queue and, for safety, you will not be able to access the cluster via SSH.
The Balena cluster will be unavailable from 07:00 on Monday 11th April and the cluster will be release back into service later that day.
On Tuesday 26th January there will be a mandatory fire suppression test being carried out in the same room as the Balena HPC system. We need to treat this as a potential risk of power outage to the data centre and therefore we will need to place the entire cluster into a maintenance mode during this period. This means that the Balena cluster will not be available for service on the 26th January.
We will require all users will to log out of Balena before 07:00am on the 26th January. You will not need to dequeue any workloads already in the cluster, a maintenance reservation has been scheduled to prevent any workloads from running during this period.
We will be taking this opportunity to perform a couple disruptive tasks, such as performing a full headnode failover, to ensure the cluster is in full working order.
I shall send around a reminder about this next week. Sorry for any inconvenience this may cause.
Maintenance and upgrades will be taking place, during our at risk* period Tuesday 22 September 2015, 7am-9am.
The network will be undergoing maintenance. You may experience minor disruptions if you are using the Internet and any of the services below during this time. If you do, please try refreshing the page or returning after 9am. Services affected are:
- The University Website
- TeamBath website
- Authentication servers
- Registration online
- Skype for Business
Additional work also taking place during this time
- UniDesk is being upgraded to the latest software and will be unavailable (7am-9am)
- Ansys is undergoing a license server upgrade, will be unavailable (7am-9am)
- Visa is undergoing a license server upgrade and battery replacement (7am-9am). It will affect the following:
- Spartan Student
- The Balena HPC service will be undergoing maintenance and will be unavailable (7am-9am)
*Our 'at risk' period is between 7am and 9am on Tuesdays when we carry out scheduled maintenance, modifications and testing. This work is essential to maintain and develop the services that we provide. Thank you for your patience during the maintenance period.
Use go.bath.ac.uk/it-status to find out the current status of IT services at the University of Bath
The Balena HPC service is now ready for use after the BeeGFS parallel file system upgrade - new features available after this upgrade include informational quota and quota enforcement.
We have successfully completed cluster wide pre-production tests to ensure that the system is stable for production use.
From 27th July the BeeGFS storage on the Balena HPC cluster will be undergoing an upgrade. We are expecting that Balena will be unavailable for, at most, the entire week while ClusterVision perform the upgrade. During this period there will be limited access to the cluster and only the /home file system will be available, also you will not be able to run workloads on the cluster.
To remind you the BeeGFS file system is a non-archival filesystem and is not backed up. If there is any essential data you would like preserved please make a copy of it before the 27th July.
In favour of presenting a better categorisation of the software packages available on Balena to improve the user experience, the software packages available under the module has been re-organised into the following categories:-
- Compilers and Languages
Users can now use the module avail command to see the organised list of software.
We have also introduced a module named untested which when loaded will list all the new software packages currently undergoing user testing. Once approved for production the software will be moved into one of the above listed categories.