These days, we’re getting a lot of interest from our clients about composable disaggregated infrastructure (CDI), including what the most critical elements are for CDI-based clusters.
Successful deployments are more likely when clients understand why their design team focuses on certain areas more than others and how design decisions can impact end user experience, so we wanted to outline some key elements of CDI-based clusters.
At its simplest, CDI is a software-defined method of disaggregating compute, storage, and networking resources into shared resource pools. These disaggregated resources are connected by an NVMe-over-fabric (NVMe-oF) solution so that you can dynamically provision hardware and optimize resource utilization. Because it decouples applications and workloads from the underlying hardware, it allows you to redeploy resources to new workloads wherever they’re needed.
In this way, the CDI design provides the flexibility of the cloud and the value of virtualization but the performance of bare metal. CDI offers the ability to run diverse workloads on a cluster while still optimizing for each workload, but there are two key components to consider for an optimized CDI cluster.
The software-defined nature of CDI means the software the cluster runs on must be best-in-class. Beyond that, however, you need to look into the specific areas of focus for the software and what it brings to the cluster.
The two software providers we believe meet the rigors of CDI-based clusters are Liqid and Giga IO. Each has its own fans, often because of the small differences in area of focus. Below is a quick overview of each, but you should work with your cluster design partner to dive more deeply into how the choice of CDI software aligns to your particular use case:
Liqid Command Center™ is a powerful resource orchestration software that dynamically composes physical servers on-demand from pools of bare-metal resources. Command Center provides:
This flexibility is paired with powerful improvements in performance, optimization, and efficiency.
GigaIO FabreX is an enterprise-class, open-standard solution that enables complete disaggregation and composition of all resources in the rack. FabreX allows you to use your preferred vendor and model for servers, GPUS, FPGAs, storage, and for any other PCIe resource in your rack. In addition to composing resources to servers, FabreX can compose servers over PCIe. FabreX enables true server-to-server communication across PCIe and makes cluster scale compute possible, with direct memory access by an individual server to system memories of all other servers in the cluster fabric.
The right high-performance, low-latency networking is the second critical element to an optimized CDI cluster. That’s because the networking technology of a CDI cluster is a fixed resource with a fixed effect on performance, as opposed to other resources that can be disaggregated. You can disaggregate compute (Intel, AMD, FPGAs), data storage (NVMe, SSD, Intel Optane, etc.), GPU accelerators (NVIDIA GPUs), and more however you see fit, but the networking underneath all those components stays the same.
An optimal network strategy is essential for a CDI deployment in order to ensure optimal performance no matter how you deploy resources to accommodate your workflows. Depending on the use case, we use NVIDIA HDR InfiniBand or NVIDIA Spectrum Ethernet switches. InfiniBand is ideal for large scale or high performance. Ethernet is an ideal choice for smaller clusters. This way, as you expand over time, the underlying network is built to support any future needs in the lifecycle of that system.
One of the reasons CDI is generating so much buzz is that CDI is a compelling option to meet demanding and complex workflows, such as HPC and AI, that require massive levels of costly resources.
The optimal design for a CDI cluster is one that effectively manages the on-premises data center assets while delivering flexibility typically provided by the cloud. This requires significant engineering expertise, though. This usually takes a great deal of time, however, which is why looking for CDI-based reference architectures is a great idea.
That’s why Silicon Mechanics has created the Miranda CDI Cluster reference architecture as the ideal starting place for clients who want to take advantage of CDI. The Miranda CDI Cluster is a Linux-based reference architecture that provides a strong foundation for building disaggregated environments.
Get a comprehensive understanding of CDI clusters like the Miranda Cluster and what they can do for your organization by downloading the insideHPC white paper on CDI.
Silicon Mechanics, Inc. is one of the world’s largest private providers of high-performance computing (HPC), artificial intelligence (AI), and enterprise storage solutions. Since 2001, Silicon Mechanics’ clients have relied on its custom-tailored open-source systems and professional services expertise to overcome the world’s most complex computing challenges. With thousands of clients across the aerospace and defense, education/research, financial services, government, life sciences/healthcare, and oil and gas sectors, Silicon Mechanics solutions always come with “Expert Included” SM.
Accelerate your performance on even the most challenging workloads with Silicon Mechanics systems based on 4th Gen Intel Xeon processors.READ MORE
Composable infrastructure on the edge is a big change from the fixed form factors that HPC and AI have historically relied upon.READ MORE
Our engineers are not only experts in traditional HPC and AI technologies, we also routinely build complex rack-scale solutions with today's newest innovations so that we can design and build the best solution for your unique needs.
Talk to an engineer and see how we can help solve your computing challenges today.