NVIDIA Selene: World’s Seventh Fastest Supercomputer, Built in 3 weeks
Understanding What Makes NVIDIA Selene So Unique
We are aware of how assembling supercomputers can take months to build, if not years. However, this summer NVIDIA broke all the records when it successfully managed to make one in three and a half weeks, that too, during a pandemic crisis. Named as Selene, this supercomputer is currently at the seventh position on the overall TOP500 at 27.5 petaflops on the Linpack benchmark (India’s fastest supercomputer Pratyush delivers a meager 3.7 petaflops) and was No. 2 on the latest Green500 list and No. 7 on the Unlike CPU-based designs present in the vast majority of Top-500 supercomputers, Selene has an architecture based on GPU accelerators (NVIDIA’s DGX SuperPOD), which makes it 6.8x more efficient than the average TOP500 system. According to the NVIDIA blog, at 20.5 gigaflops/watt, Selene is within a fraction of a point from the top spot on the Green500 list, claimed by a much smaller system that ranked No. 394 by performance. Furthermore, it is the only top 100 system to crack the 20 gigaflops/watt barrier.
NVIDIA deployed Selene to tackle problems around concepts such as protein docking and quantum chemistry, which are crucial to developing an understanding of the coronavirus and a potential cure for the COVID-19 disease. It is currently used by the Argonne National Laboratory to research ways to stop the COVID-19 coronavirus. Meanwhile, the University of Florida plans to use the design to build the fastest AI computer in academia. Located at Santa Clara, California, Selene is composed of 280 NVIDIA DGX A100 systems, each integrating eight NVIDIA A100 Tensor Core GPUs, interconnected by 494 NVIDIA Mellanox Quantum QM8790 HDR 200Gb/s InfiniBand smart switches. It is capable of delivering 1+ exaflops of AI performance, which makes it the fastest industrial system in the USA. This is powered by seven petabytes of DDN A3I scalable storage. The accelerator is equipped with 6,912 CUDA cores, 40 GB of dedicated HBM2 memory, and 432 tensor cores specialized in artificial intelligence, inference, and deep learning tasks. Selene’s four storage tiers span 100 terabyte/second memory links to 100 Gb/s storage pools.
Learning from History
Apart from achieving this incredible feat amidst pandemic, NVIDIA says it drew inspiration and lessons based on its previous experiences and attempts in building a supercomputer. Its engineers were motivated by two ideas; one was to design something both powerful enough to train the AI models their colleagues were building for autonomous vehicles, and second, was to enable the system to cater to the general needs of any deep-learning researcher. This led to the birth of the SATURNV cluster in 2016, which was based on the NVIDIA Pascal GPU. Later in June 2019, they built Circe, which is at present the world’s 23rd fastest supercomputer. Circe was crafted on 96 of massive clusters of V100-based NVIDIA DGX-2 systems, called DGX PODs, thus culminating in creating the DGX SuperPOD. Besides, Circe’s network is based on scalable modules of 20 nodes that are connected by relatively simple “thin switches” that can be laid down cookie-cutter style, turned on, and tested before another is added.
So, working with different lengths of cables that were tied up with Velcro and racks that can be labeled and mapped gave the engineers idea about the simplicity of having a balanced design that enables a supercomputer to scale as per requirement. The flexibility of its design also means that researchers have much more freedom to explore new directions in AI and high-performance computing. This homogeneity helped in quick assembly of Selene, where each DGX pod was moved into the right port, wired, tested, and the physical connectivity optimized for the most prevalent software and application connectivity needs. As mentioned earlier, Selene used Mellanox’s InfiniBand switches to reduce the number of cables required while simultaneously increasing bandwidth. Also, to monitor Selene, NVIDIA bought Trip, a telepresence robot from Double Robotics. An NVIDIA Jetson TX2 powers the Trip robot. This robot lets the remote team virtually observe Selene via its camera and microphone. The team also built a bot for Slack that sends them notifications when the hardware is misbehaving, monitor the status LEDs or when a cable has come loose.
In addition to these, Selene is cooled on a per-SuperPOD basis. Every SuperPOD resides in one big air-conditioned warehouse, raised off the ground, and having fans underneath for pushing cool air up in the pods. NVIDIA team only needed to install the flooring and seal up the SuperPODs to control the flow of air.
One can have a look at Selene here.