IT services status

Subscribe to IT services status

3rd July 2015

Aquila system outage 30th June - 2nd July 2015

Posted in: HPC

APC, Aquila, FhGFS, High Performance Computing, HPC, maintenance, memory fault, outage, power overload, storage, unplanned

On the 30th June the storage server providing /apps and /data lost power and resulted in the Aquila system becoming unresponsive. This issue was fixed by reseating the power supply unit on the storage node and the storage node boot cleanly. In running a Linpack test job to confirm system funcationality, two of the power units tripped due to a power overload. These power units are now about 8 years old and are probably a bit past their prime now. The load on the over all system has been reduced to prevent the breakers tripping again, and 24 nodes have been powered off.

Aquila is operating with 76 cpu nodes and 2 gpu nodes. A retest of the HPL job over 76 nodes went through smoothly.

On restoring the cluster, we have discovered a further issue this time a memory issue on one of the storage nodes providing the parallel FhGFS system. We have diagnosed this as being an issue with one of the memory slots for the memory modules. We have reordered the memory modules in the DIMM slots which has allowed the system to come back up and has now restored the FhGFS storage service on Aquila.

Team HPC

Posted in: HPC

APC, Aquila, FhGFS, High Performance Computing, HPC, maintenance, memory fault, outage, power overload, storage, unplanned

Steven Chapman
27th January 2017

Balena maintenance - 2nd February 2017

During Inter-Semester break, on Thursday 2 February between 09:00 and 13:00, we will be placing the cluster into maintenance mode whilst we perform failover tests between the pair of master nodes and BeeGFS node couplets. These tests will ensure that...
Steven Chapman
25th July 2016

Balena Maintenance - 8th to 12th August 2016

The maintenance work will begin on Monday 8th August and is expected to take up to a week to complete. During this maintenance window there will be no access to the Balena system and all queued jobs will need to...
Steven Chapman
30th March 2016

Balena HPC maintenance 11th April 2016

On Monday 11th April, the Balena HPC cluster will be unavailable due to maintenance work. Part of this work will include ClusterVision performing a full health check to ensure Balena is running at optimal performance. During this time, all jobs...

IT services status

Subscribe to IT services status

Aquila system outage 30th June - 2nd July 2015

Read next

Balena maintenance - 2nd February 2017

Balena Maintenance - 8th to 12th August 2016

Balena HPC maintenance 11th April 2016