Mid-life upgrade for the University's HPC service

Posted in: High Performance Computing (HPC), Research

Earlier this year the University allocated funding, awarded by HEFEC, towards supporting teaching activities on the HPC service. Since, the team have been busy working with our suppliers in designing a mid-life enhancement to the Balena HPC service. The new hardware will consist of:

  1. 17 Intel Skylake nodes, with Xeon Gold 6126 processors
  2. 7 new NVIDIA P100 GPU cards
  3. 2 new NVMe SSD cards
  4. additional 380TBs storage for BeeGFS scratch
  5. dedicated and backed up BeeGFS parallel filesystem for undergraduate and masters students

Installation and commissioning of the new hardware will be carried out later this month with the expectation to make the Skylake nodes generally available in October.

 

Further details of the new hardware

1) Intel Skylake nodes
We will be introducing 17 new nodes based on the latest Intel Skylake Xeon Gold 6126 processor. 16 nodes, 384 cores, will be available via the batch service and 1 node via the ITD service for compiling software and development. Each nodes will have two 2.6GHz, 12 core sockets and 8GBs per core, a total of 24 cores and 192GB RAM per nodes. The Skylake processors will have 6 memory channels, 50% more than our current IvyBridge nodes, and will also support AVX-512 instructions (512-bit SIMD support). The nodes will be connected to the existing TrueScale Infiniband fabric (40Gbit/s) which will allow MPI communication between Skylake nodes and for sharing the BeeGFS storage areas.

2) NVIDIA P100 GPU cards
Seven new NVIDIA P100 GPU 16GB PCI-e Gen3 accelerator cards will be installed in the IvyBridge nodes. To accommodate these new P100 cards we will be shuffling around the nodes which have single K20x GPU cards. The final configuration will have a mixture of 1GPU/node (P100 & K20x), 2GPUs/node (just K20x) and 4GPUs/node (P100 & K20x); we will be keeping the four ASUS nodes which currently have 4 K20x GPUs each. Slurm will be updated to understand the different cards and you will be able to select the type of card from the job submission files.

3) NVMe SSD cards
Intel Solid-State Drive DC P3600 Series with 2TBs storage. To help accelerate data I/O intensive workloads we have purchased two NVMe SSD each with 2TBs of storage. They are optimised for small file size (4KB) operations and can be used if performance needs to be reproduced. They will be fitted into two of the Ivybridge nodes (one each) and will be presented as a local filesystem on the nodes. This is new and experimental for Balena so we are keen to work with groups or individuals to explore the benefits. Applications which are aware of which might see an immediate benefit are Gaussian and those using DeepLearning tools.

4) BeeGFS scratch expansion
The non-archival BeeGFS scratch area will be expanded with about 380TBs of additional storage with a new pair of BeeGFS storage nodes, giving a total of about 0.6PBs usable on scratch. To note, you will not see a huge improvement performance of individual read/write operations, but there will be an overall aggregate performance improvement of the BeeGFS scratch solution, we're expecting to be able to drive this at around 18GB/s for certain I/O patterns. We will also be increasing the metadata capacity across the existing storage systems to help the system cope with the large number of files we are seeing.

5) BeeGFS for Undergraduates and Masters students
We will be introducing a second BeeGFS parallel filesystem providing about 50TBs of resilient and backup protected performant data store. This will be available to Undergraduates and Masters level students and for sharing large read-mostly data sets.

 

It's going to be quite exciting over the next few months whilst the new equipment is being installed, we'll be keeping you updated with progress.

Team HPC

Posted in: High Performance Computing (HPC), Research

Respond

  • (we won't publish this)

Write a response