Welcome, Please Sign In

Get in touch with your rep, view past orders, save configurations and more. Don't have an account? Create one in seconds below.

login

Getting to Result Faster: Why Good Engineers Supplement Strong AI Clusters with Strong Software Stacks

By Curtis Elgin, Engineer, Silicon Mechanics
November 28, 2021

Machine learning (ML) has impacted nearly every aspect of our daily lives, from online customer support to search engine result filtering. Because of this, ML has moved so far into the mainstream of society that it is now often regarded simply as artificial intelligence (AI), even though this oversimplifies the complex nature of ML. Properly supported, advanced ML projects can drive some of tomorrow’s most transformative technologies such as self-driving cars, big data analytics, voice & facial recognition, and augmented reality. However, as this technology, and the underlying hardware and software tools that enable it, progresses, there is increasing expectation that “better” clusters are ones that don’t just perform better, but also faster.

Supporting Machine Learning and Deep Learning Workloads

As ML and deep learning (DL) models continue to grow in both scale and complexity, they demand solutions with extensive computing power, high-speed and high-capacity storage, and low-latency, high-bandwidth interconnects. Modern AI hardware technologies can provide plenty of performance. However, these systems require a large investment of time, expertise, and funding.

That’s why organizations partner with expert solution designers and consultants with years of experience deploying reliable, high-performance AI environments. These solutions can require investments in the millions but aren’t always ‘ready-to-run’ when they arrive on site.

To train and deploy these different AI models, users rely on various AI frameworks and development tools that support specific types of AI. Sourcing and integrating these software tools is an extra step in the procurement process that takes time, resources, and expertise to execute properly.

That’s why top AI solution providers take the extra step to include pre-configured, pre-tested software stacks, such as the Silicon Mechanics AI Stack or Silicon Mechanics Scientific Computing Stack. End users save time by avoiding the efforts required to set up their own stack. However, there is the additional value of your engineering team being intimately familiar with the applications required for different workloads. The more we know about what you'll be doing, the more we can optimize the cluster's design to support your particular situation.

The Equivalent of a LAMP Stack

The open source, LAMP stack has had a huge impact in the growth of software development, which in turn has led to some amazing AI applications and use cases.

The benefits of a pre-installed, pre-tested software mentioned above are potentially so strong that these cluster software stacks may become as ubiquitous as the LAMP stack has become in software development. The major difference is that LAMP is well defined while AI and big data stacks are still emerging, as more organizations get involved in these sectors and as adoption of big data workloads continues.

Today, each engineering team looks at what types of clients and partners it has and then determines what sort of software it can effectively source, test, and integrate. In our case, the team here at Silicon Mechanics created this stack for our customers:

  • Ubuntu, an open-source Linux distribution, commonly used for AI and HPC systems
  • TensorFlow, an open-source software library focused on developing deep neural networks
  • PyTorch, an open-source ML library for natural language processing and computer vision applications
  • Keras, an open-source software library that provides a Python interface for artificial neural networks. Keras supports TensorFlow, Microsoft Cognitive Toolkit, Theano, and PlaidML
  • cuDNN, a GPU-accelerated library for deep neural networks. cuDNN provides implementations for forward and backward convolution, pooling, normalization, activation layers, and other standard routines.
  • NVIDIA CUDA, a parallel computing platform and API that allows software to use NVIDIA GPUs for general purpose processing, a key component of enabling AI.
  • NVIDIA HPC, a comprehensive software development kit for GPU accelerating HPC modeling and simulation applications. It includes C, C++, and Fortran compilers, libraries, and analysis tools.
  • R, a language and environment for statistical computing and graphics that enables data manipulation, calculation and graphical display
  • And more…

Integrating Hardware and Software

Beyond the software stack itself, another way we've found to boost the performance and speed of clusters is to ensure the hardware is optimized for the type of workloads it will be running. As noted above, engineers who know your workload can optimize the cluster much better for your specific needs. We even went so far as to use the pre-source, pre-integrated, pre-tested concept to the cluster so we don’t have to start from scratch with our designs each time we work with a client.

Instead, we’ve designed a specific reference architecture for AI environments, the Silicon Mechanics Atlas AI Cluster. Using best-of-breed technology (including NVIDIA A100 GPUs for industry-leading GPU performance) in white box servers, the Linux-based Atlas AI Cluster provides performance, reliability, and scalability for AI along with the fast start of an integrated, tested software stack specific to AI.

The Silicon Mechanics Atlas AI Cluster also features low total cost of ownership compared to traditional supercomputers.

To maximize the ROI of your AI platform, we use a building block approach, where computing, storage, and networking components are configured in standardized reliable sizes which can be scaled incrementally to meet specific performance needs. This lets us push the boundaries of AI clusters, and optimize AI models to accommodate a wide variety of use cases such as natural language processing, predictive analytics, cybersecurity, business intelligence, virtual assistants, and robotics to name a few.

Moving Forward

Organizations looking to leverage ML and DL must find smarter ways to optimize for different AI models. As open software and hardware experts, we pride ourselves on working directly with you to understand your technical and business requirements, and pair you with our best-fit computing solutions for your AI needs. We encourage you to learn more about our Atlas AI cluster and our AI software stack, to see why it is the right platform for your AI deployment.

Learn more about key infrastructure areas to focus on for AI, ML, and more by reading this white paper.


About Silicon Mechanics

Silicon Mechanics, Inc. is one of the world’s largest private providers of high-performance computing (HPC), artificial intelligence (AI), and enterprise storage solutions. Since 2001, Silicon Mechanics’ clients have relied on its custom-tailored open-source systems and professional services expertise to overcome the world’s most complex computing challenges. With thousands of clients across the aerospace and defense, education/research, financial services, government, life sciences/healthcare, and oil and gas sectors, Silicon Mechanics solutions always come with “Expert Included” SM.

Latest News

AMD Ryzen Threadripper PRO

February 23, 2024

AMD Ryzen Threadripper PRO 7000 WX-Series: Is It Worth the Upgrade?

READ MORE

Revolutionizing Content Delivery/Streaming w/ Networking & AI

July 10, 2023

Building an infrastructure to deliver high-performance networking and AI is critical to taking content delivery and streaming services to the next level.

READ MORE

Latest in Social

@
December 31, 1969

Expert Included

Our engineers are not only experts in traditional HPC and AI technologies, we also routinely build complex rack-scale solutions with today's newest innovations so that we can design and build the best solution for your unique needs.

Talk to an engineer and see how we can help solve your computing challenges today.