HPC Aquila is available

Posted in: IT disruption

Aquila is available and stable, 5.30pm 02 July 2015.  Aquila is running at reduced capacity with 76 standard nodes and the two gpu nodes.

Problem summary

Original problem on Tuesday, 30 June 2015, was with a power supply on one of the storage servers serving /apps and /data. This issue was fixed by reseating the power supply.

Yesterday there were issues with some of the power devices - two of the power supplies are nolonger capable of serving a whole bank of nodes. In delving through the logs, it was discovered that the modest load on the cluster over the last few months has been approaching the overload limts on the devices on numerous occasions. So when running the HPL test yesterday it tripped the breakers on the power strips. To work round this the number of nodes has been reduced on these devices and as a precaution the nodes on the other power devices have been reduced as well.

You may continue to use Aquila and resubmit workloads.

Posted in: IT disruption