During Inter-Semester break, on Thursday 2 February between 09:00 and 13:00, we will be placing the cluster into maintenance mode whilst we perform failover tests between the pair of master nodes and BeeGFS node couplets.
These tests will ensure that pairs of master nodes and BeeGFS node couplets are in good working order should an unexpected system issue occur that triggers a system failover.
While the system failovers are being tested, all users will be able to access data from /home and /beegfs, but you may notice a momentary freeze while the storage areas are transferred between the failover pairs. All compute nodes will be placed into a scheduler reservation to prevent any workloads from running while these tests are carried out.
Sorry for the short notice of this announcement, I hope this will not cause too much disruption for anyone.
On Monday 11th April, the Balena HPC cluster will be unavailable due to maintenance work. Part of this work will include ClusterVision performing a full health check to ensure Balena is running at optimal performance. During this time, all jobs will be held in the queue and, for safety, you will not be able to access the cluster via SSH.
The Balena cluster will be unavailable from 07:00 on Monday 11th April and the cluster will be release back into service later that day.
On Tuesday 26th January there will be a mandatory fire suppression test being carried out in the same room as the Balena HPC system. We need to treat this as a potential risk of power outage to the data centre and therefore we will need to place the entire cluster into a maintenance mode during this period. This means that the Balena cluster will not be available for service on the 26th January.
We will require all users will to log out of Balena before 07:00am on the 26th January. You will not need to dequeue any workloads already in the cluster, a maintenance reservation has been scheduled to prevent any workloads from running during this period.
We will be taking this opportunity to perform a couple disruptive tasks, such as performing a full headnode failover, to ensure the cluster is in full working order.
I shall send around a reminder about this next week. Sorry for any inconvenience this may cause.
Maintenance and upgrades will be taking place, during our at risk* period Tuesday 22 September 2015, 7am-9am.
The network will be undergoing maintenance. You may experience minor disruptions if you are using the Internet and any of the services below during this time. If you do, please try refreshing the page or returning after 9am. Services affected are:
- The University Website
- TeamBath website
- Authentication servers
- Registration online
- Skype for Business
Additional work also taking place during this time
- UniDesk is being upgraded to the latest software and will be unavailable (7am-9am)
- Ansys is undergoing a license server upgrade, will be unavailable (7am-9am)
- Visa is undergoing a license server upgrade and battery replacement (7am-9am). It will affect the following:
- Spartan Student
- The Balena HPC service will be undergoing maintenance and will be unavailable (7am-9am)
*Our 'at risk' period is between 7am and 9am on Tuesdays when we carry out scheduled maintenance, modifications and testing. This work is essential to maintain and develop the services that we provide. Thank you for your patience during the maintenance period.
Use go.bath.ac.uk/it-status to find out the current status of IT services at the University of Bath
The Balena HPC service is now ready for use after the BeeGFS parallel file system upgrade - new features available after this upgrade include informational quota and quota enforcement.
We have successfully completed cluster wide pre-production tests to ensure that the system is stable for production use.
From 27th July the BeeGFS storage on the Balena HPC cluster will be undergoing an upgrade. We are expecting that Balena will be unavailable for, at most, the entire week while ClusterVision perform the upgrade. During this period there will be limited access to the cluster and only the /home file system will be available, also you will not be able to run workloads on the cluster.
To remind you the BeeGFS file system is a non-archival filesystem and is not backed up. If there is any essential data you would like preserved please make a copy of it before the 27th July.
On the 30th June the storage server providing /apps and /data lost power and resulted in the Aquila system becoming unresponsive. This issue was fixed by reseating the power supply unit on the storage node and the storage node boot cleanly. In running a Linpack test job to confirm system funcationality, two of the power units tripped due to a power overload. These power units are now about 8 years old and are probably a bit past their prime now. The load on the over all system has been reduced to prevent the breakers tripping again, and 24 nodes have been powered off.
Aquila is operating with 76 cpu nodes and 2 gpu nodes. A retest of the HPL job over 76 nodes went through smoothly.
On restoring the cluster, we have discovered a further issue this time a memory issue on one of the storage nodes providing the parallel FhGFS system. We have diagnosed this as being an issue with one of the memory slots for the memory modules. We have reordered the memory modules in the DIMM slots which has allowed the system to come back up and has now restored the FhGFS storage service on Aquila.
Aquila is available and stable, 5.30pm 02 July 2015. Aquila is running at reduced capacity with 76 standard nodes and the two gpu nodes.
Original problem on Tuesday, 30 June 2015, was with a power supply on one of the storage servers serving /apps and /data. This issue was fixed by reseating the power supply.
Yesterday there were issues with some of the power devices - two of the power supplies are nolonger capable of serving a whole bank of nodes. In delving through the logs, it was discovered that the modest load on the cluster over the last few months has been approaching the overload limts on the devices on numerous occasions. So when running the HPL test yesterday it tripped the breakers on the power strips. To work round this the number of nodes has been reduced on these devices and as a precaution the nodes on the other power devices have been reduced as well.
You may continue to use Aquila and resubmit workloads.
The Aquila culster is still unavailable, while problems are investigated by the HPC team.
Essential maintenance work will be performed on Aquila's home area storage on 7th July 2015, between 7-9am. While this work is being carried out Aquila will not be available or accessible.
All users are required to log out before 8pm on Monday 6th July in preperation for the work being carried out on Tuesday morning; any users still logged in will be logged out.
Access to Aquila we be reopened shortly after 9am on Tuesday.