Today’s IT organizations must maximize their resource utilization to deliver the computing capabilities their organization needs when and where it’s needed. This has resulted in many organizations building multi-purpose clusters, which impacts performance.
Even worse from an ROI perspective, in many instances, once resources are no longer required for a particular project, they cannot be redeployed to another workload with precision and efficiency. Composable disaggregated infrastructure (CDI) can hold the key to solving this optimization problem, while also providing bare metal performance.
At its core, CDI is the concept of using a set of disaggregated resources connected by a NVMe over fabric solution so that you can dynamically provision hardware, regardless of scale. This infrastructure design provides the flexibility of the cloud and the value of virtualization but the performance of bare metal. Because it decouples applications and workloads from the underlying hardware, CDI offers the ability to run diverse workloads on a cluster while still optimizing for each workload and even support multi-tenant environments.
Software providers often used in CDI-based clusters include Liqid CDI and Giga IO. Liqid Command Center™ is a powerful management software platform that dynamically composes physical servers on demand from pools of bare-metal resources. GigaIO FabreX is an enterprise-class, open-standard solution that enables complete disaggregation and composition of all resources in the rack.
The disaggregated resources in CDI allow you to dynamically provision clusters using best fit hardware without the reduction in performance that you would get in a cloud-based environment. With respect to HPC and AI, the value of CDI comes from the flexibility of the underlying hardware, different workloads, and environments. This improves cost effectiveness and scalability compared to cloud services and cloud service providers, improving ROI and lowering costs.
For AI and HPC workloads, performance is still top priority and on-premises hardware provides better performance, with the ability to burst to the cloud on an as-needed basis. A well-designed cluster built with commercial off-the-shelf (COTS) hardware elements and connected with PCIe, Ethernet, and InfiniBand can increase the utilization, flexibility, and effective use of valuable data center assets. Organizations that implement CDI realize a 2x to 4x increase in data center resource utilization, on average.
Beyond optimizing resource allocation, CDI also provides several additional benefits for your dynamically configured system:
A wide variety of technology areas can benefit from CDI. These include:
For deep learning, it is best to keep clusters on-premises because on-premises computing can be more cost-effective than cloud-based computing when highly utilized. It’s also advisable to keep primary storage close to on-premises compute resources to maximize network bandwidth while limiting latency.
There are two critical factors in deploying a successful CDI-based cluster. The first is a design that properly integrates leading-edge CDI software.
As mentioned above, two software platforms often used in CDI clusters are Liqid Command Center and GigaIO FabreX. Both are technologies Silicon Mechanics has worked with before and uses in our CDI-based clusters.
Liqid Command Center is a fabric management software for bare-metal machine orchestration. Command Center provides:
Policy-based automation and dynamic provisioning of resources
Advanced cluster, machine, and device statistics and monitoring
Scalable architecture supporting high availability (HA)
Multiple control methods, including GUI and RESTful API
GigaIO FabreX is an open-standard solution that allows you to use your preferred vendor and model for servers, GPUS, FPGAs, storage, and for any other PCIe resource in your rack. In addition to composing resources to servers, FabreX can compose servers over PCIe. FabreX enables true server-to-server communication across PCIe and makes cluster scale compute possible, with direct memory access by an individual server to system memories of all other servers in the cluster fabric.
High-performance, low-latency networking, like InfiniBand from NVIDIA Networking, is the second critical element to the way CDI operates. It’s possible to disaggregate just about everything—compute (Intel, AMD, FPGAs), data storage (NVMe, SSD, Intel Optane, etc.), GPU accelerators (NVIDIA GPUs), etc. You can rearrange these components however you see fit, but the networking underneath all those pipes stays the same. Think of networking as a fixed resource with a fixed effect on performance, as opposed to other resources that are disaggregated.
It is important to plan out an optimal network strategy for a CDI deployment. InfiniBand is ideal for large scale or high performance. Conversely, Ethernet is a strong choice for smaller clusters. If you expand over time, you’ve got that underlying network to support anything that comes up in the lifecycle of that system.
Today, many organizations run demanding and complex workflows, such as HPC and AI, that require massive levels of costly resources. This drives IT departments to find flexible and agile solutions that effectively manage the on-premises data center while delivering the flexibility typically provided by the cloud. CDI is quickly emerging as a compelling option to meet the demands for deploying applications that incorporate advanced technologies.
Silicon Mechanics is an engineering firm providing custom, best-in-class solutions for HPC/AI, storage, and networking, based on open standards. The Silicon Mechanics Miranda CDI Cluster is a Linux-based reference architecture that provides a strong foundation for building disaggregated environments.
Get a comprehensive understanding of CDI clusters and what they can do for your organization by downloading the Inside HPC white paper on CDI.
Silicon Mechanics, Inc. is one of the world’s largest private providers of high-performance computing (HPC), artificial intelligence (AI), and enterprise storage solutions. Since 2001, Silicon Mechanics’ clients have relied on its custom-tailored open-source systems and professional services expertise to overcome the world’s most complex computing challenges. With thousands of clients across the aerospace and defense, education/research, financial services, government, life sciences/healthcare, and oil and gas sectors, Silicon Mechanics solutions always come with “Expert Included” SM.
The new generation of AMD EPYC processors is here, and it brings major advancements with it. At Silicon Mechanics, we see these new processors as a notable boost to performance, higher cache, better performance per watt, and more.READ MORE
Using big data analytics & predictive analytics through DL is essential but these tactics are not simple, and you need a properly designed infrastructure.READ MORE
Our engineers are not only experts in traditional HPC and AI technologies, we also routinely build complex rack-scale solutions with today's newest innovations so that we can design and build the best solution for your unique needs.
Talk to an engineer and see how we can help solve your computing challenges today.