During Inter-Semester break, on Thursday 2 February between 09:00 and 13:00, we will be placing the cluster into maintenance mode whilst we perform failover tests between the pair of master nodes and BeeGFS node couplets.
These tests will ensure that pairs of master nodes and BeeGFS node couplets are in good working order should an unexpected system issue occur that triggers a system failover.
While the system failovers are being tested, all users will be able to access data from /home and /beegfs, but you may notice a momentary freeze while the storage areas are transferred between the failover pairs. All compute nodes will be placed into a scheduler reservation to prevent any workloads from running while these tests are carried out.
Sorry for the short notice of this announcement, I hope this will not cause too much disruption for anyone.
The maintenance work will begin on Monday 8th August and is expected to take up to a week to complete. During this maintenance window there will be no access to the Balena system and all queued jobs will need to be cleared from the scheduler.
The majority of this work will be performed by ClusterVision. We are anticipating needing a full week to give ClusterVision and ourselves enough time to complete these maintenance tasks. We shall open up access once all disruptive tasks have been completed.
Below is a list of some of the maintenance work which will be taking place:
- Upgrading the SLURM scheduler, security patching and enabling new features
- Testing SLURM's node power management
- Enabling global file locking on the BeeGFS scratch partition
- ClusterVision will also be configuring new system monitoring tools
The Balena HPC service is now ready for use after the BeeGFS parallel file system upgrade - new features available after this upgrade include informational quota and quota enforcement.
We have successfully completed cluster wide pre-production tests to ensure that the system is stable for production use.
From 27th July the BeeGFS storage on the Balena HPC cluster will be undergoing an upgrade. We are expecting that Balena will be unavailable for, at most, the entire week while ClusterVision perform the upgrade. During this period there will be limited access to the cluster and only the /home file system will be available, also you will not be able to run workloads on the cluster.
To remind you the BeeGFS file system is a non-archival filesystem and is not backed up. If there is any essential data you would like preserved please make a copy of it before the 27th July.