During Inter-Semester break, on Thursday 2 February between 09:00 and 13:00, we will be placing the cluster into maintenance mode whilst we perform failover tests between the pair of master nodes and BeeGFS node couplets.
These tests will ensure that pairs of master nodes and BeeGFS node couplets are in good working order should an unexpected system issue occur that triggers a system failover.
While the system failovers are being tested, all users will be able to access data from /home and /beegfs, but you may notice a momentary freeze while the storage areas are transferred between the failover pairs. All compute nodes will be placed into a scheduler reservation to prevent any workloads from running while these tests are carried out.
Sorry for the short notice of this announcement, I hope this will not cause too much disruption for anyone.
The maintenance work will begin on Monday 8th August and is expected to take up to a week to complete. During this maintenance window there will be no access to the Balena system and all queued jobs will need to be cleared from the scheduler.
The majority of this work will be performed by ClusterVision. We are anticipating needing a full week to give ClusterVision and ourselves enough time to complete these maintenance tasks. We shall open up access once all disruptive tasks have been completed.
Below is a list of some of the maintenance work which will be taking place:
- Upgrading the SLURM scheduler, security patching and enabling new features
- Testing SLURM's node power management
- Enabling global file locking on the BeeGFS scratch partition
- ClusterVision will also be configuring new system monitoring tools
On Monday 11th April, the Balena HPC cluster will be unavailable due to maintenance work. Part of this work will include ClusterVision performing a full health check to ensure Balena is running at optimal performance. During this time, all jobs will be held in the queue and, for safety, you will not be able to access the cluster via SSH.
The Balena cluster will be unavailable from 07:00 on Monday 11th April and the cluster will be release back into service later that day.
On Tuesday 26th January there will be a mandatory fire suppression test being carried out in the same room as the Balena HPC system. We need to treat this as a potential risk of power outage to the data centre and therefore we will need to place the entire cluster into a maintenance mode during this period. This means that the Balena cluster will not be available for service on the 26th January.
We will require all users will to log out of Balena before 07:00am on the 26th January. You will not need to dequeue any workloads already in the cluster, a maintenance reservation has been scheduled to prevent any workloads from running during this period.
We will be taking this opportunity to perform a couple disruptive tasks, such as performing a full headnode failover, to ensure the cluster is in full working order.
I shall send around a reminder about this next week. Sorry for any inconvenience this may cause.
To update everyone on the condition of the /data storage area. I have some good news!
We received the new power supply from ClusterVision for the storage server, after fitting the new supply we discovered that the mainboard has also experienced a power issue. However, with a bit of tinkering we have been able to use a second server to host the PCI Raid card and with using the new power supply in the storage server to power the disk array (see images below), we have successfully been able to bring back the /data storage area.
We are currently copying this data area (/data) and all other data areas on Aquila (/home and /fhgfs/data) over to a more stable storage system. Once all of the data has been transferred I will provide the details of the new location and will then be turning off the aquila head nodes and remaining storage arrays.
We are looking at keeping the aquila data on this new system until 31st January 2016, after this date we will be wiping all Aquila storage areas.
If you require access to data on the Aquila system belonging to students or researchers in your group who have left the University, please let me know so that we can arrange access.
The compute nodes on the Aquila HPC system will be powered down on 30th September 2015, the process starts at midday. As well as powering down the compute nodes, we will also be disabling the scheduler system and turning off the Infiniband switch.
With regards to storage of files and data on Aquila, there is currently 15TBs of used space in the /home area and 13TBs used under the /fhgfs area. We would like to remind everyone to clean up files/data which are no longer needed on Aquila and if there are any data you would like to preserve to transfer them off Aquila.
We are still waiting to receive the replacement power supply from ClusterVision, which will allow us to restore the /data storage area. Once we have been able to attempt the recovery of this storage area we will then make plans to turn off the all Aquila storage areas.
The Balena HPC service is now ready for use after the BeeGFS parallel file system upgrade - new features available after this upgrade include informational quota and quota enforcement.
We have successfully completed cluster wide pre-production tests to ensure that the system is stable for production use.
From 27th July the BeeGFS storage on the Balena HPC cluster will be undergoing an upgrade. We are expecting that Balena will be unavailable for, at most, the entire week while ClusterVision perform the upgrade. During this period there will be limited access to the cluster and only the /home file system will be available, also you will not be able to run workloads on the cluster.
To remind you the BeeGFS file system is a non-archival filesystem and is not backed up. If there is any essential data you would like preserved please make a copy of it before the 27th July.
On the 30th June the storage server providing /apps and /data lost power and resulted in the Aquila system becoming unresponsive. This issue was fixed by reseating the power supply unit on the storage node and the storage node boot cleanly. In running a Linpack test job to confirm system funcationality, two of the power units tripped due to a power overload. These power units are now about 8 years old and are probably a bit past their prime now. The load on the over all system has been reduced to prevent the breakers tripping again, and 24 nodes have been powered off.
Aquila is operating with 76 cpu nodes and 2 gpu nodes. A retest of the HPL job over 76 nodes went through smoothly.
On restoring the cluster, we have discovered a further issue this time a memory issue on one of the storage nodes providing the parallel FhGFS system. We have diagnosed this as being an issue with one of the memory slots for the memory modules. We have reordered the memory modules in the DIMM slots which has allowed the system to come back up and has now restored the FhGFS storage service on Aquila.
Essential maintenance work will be performed on Aquila's home area storage on 7th July 2015, between 7-9am. While this work is being carried out Aquila will not be available or accessible.
All users are required to log out before 8pm on Monday 6th July in preperation for the work being carried out on Tuesday morning; any users still logged in will be logged out.
Access to Aquila we be reopened shortly after 9am on Tuesday.