How to Increase ROI in Enterprise HPC Environment

High-Performance Computing (HPC) offers enormous computing power needed for workloads involving data analytics, AI, IIoT, 3D imaging, and more. But such power comes at a significant cost. Many-a-times, even when the HPC does the job well, but the ROI does not justify the investment. Here are six ways for enterprises to optimize the ROI of their HPC environment.

1. Perform Regular Health Checks

HPC systems are a network of thousands of servers. It requires regular maintenance and system checks to keep it running. Many enterprises do not take basic maintenance seriously and suffer from degraded performance.

  • Check the software and firmware versions, and run updates if needed. Check interdependencies when running updates. Updates may introduce incompatibilities that percolate through the environment. This may degrade performance or cause downtime.
  • Maintain the liquid cooling system proactively. Failure of this critical component heats the system and damages the CPUs.
  • Streamline troubleshooting. Deploy bundles to channel user comments into a ticketing system. Configure the bundle to get the required insights. Focus on insights such as the jobs users have running, the applications loaded into their environment, and so on.

Conduct comprehensive health checks once a quarter, to pre-empt surprises and unplanned downtimes.

2. Upgrade the Stack Periodically

Many system admins remain reluctant to tinker with their HPC setup that serves their needs. They feel the redundancy of the system makes optimization unnecessary. But it is important to update and optimize HPC systems regularly, even when things work fine. The system may work fine for present needs but invariably fall short for future applications. These systems perform quadrillions of calculations per second, and things can go wrong with the slightest glitch. Over time, systems administrators make different configuration selections, changing the setup. Also, the components replaced may not be an exact match for the original.

Perform these tasks to keep the HPC updated, and in prime working condition always:

  • Review the processing speed periodically. Invest in additional nodes and accelerators to improve speed, without relying only on system cache. High-power HPC systems carry large amounts of cache, which improves the computing speed.
  • Keep track of new versions of hardware. Invest in new hardware if the investment pays back for itself through savings.
  • Review storage space periodically. Tap into the cloud for scalability and other benefits. Computational demands are often cyclical. Catering to peak demand may cause the storage and computing assets lying underutilized most of the time. Deploy tools that can provision resources from the cloud, with easy scale-up, and scale-down, on-demand. 
  • Allow deploying multiple versions of applications catering to the specific needs of different user groups.

Perform small, incremental upgrades constantly, to avoid extended downtime required for major upgrades.

3. Review Configuration Periodically

Review the configuration periodically. New workloads may require different configurations. Waiting for new requirements before changing the configuration may cause significant delays. Optimization is time-consuming. Any delay may cause missed opportunities.

  • Examine the use-cases of the HPC systems and consider possible cases that will come up soon. Consider if the existing configuration is optimal for such needs. Review industry best practices as benchmarks. Make changes as needed.
  • Allow diverging configurations to enable customization for specific uses. But keep control over the settings.
  • Pay special attention to the memory configuration. HPCs rely on parallel processing systems and are I/O intensive. The integrity of the system depends on information flowing in and out of the memory quickly. The memory configuration affects application performance.

4. Focus on Energy Efficiency

HPC systems are energy-intensive. These machines consume as high as 30kW per rack on a normal day. Savings on energy can make a big difference to the ROI of HPC systems.

  • Make sure the cooling system works optimally, always. The high density of HPC machines makes cooling systems critical to sustaining performance.
  • Replace fans with pumps, which are more energy-efficient and emit lesser noise.
  • Explore configurations that save on power conversions and losses. The U.S. Department of Energy’s National Renewable Energy Laboratory, for instance, supplies electricity to the racks directly, rather than use the typical 208 V step down.

5. Keep critical spares on hand

Enterprises invest in HPC for immediate responses. Any downtime is unacceptable and may cause huge revenue loss, besides eroding the competitiveness of the business.

Computing systems can go down without warning. Keep critical spares on hand for instant replacements. Procuring parts when the machine goes down will extend the downtime. With the COVID-19 pandemic disrupting the supply chain, the Just-in-Time (JIT) inventory approach is no longer a virtue. Getting the required parts may take longer than normal.

Some of the mission-critical parts for HPC include:

  • Power supplies. A malfunctioning power supply can trigger instant shutdowns.
  • Compute blades. System admins may redistribute malfunctioning nodes to other nodes for a few hours. But without fast replacement, performance suffers. Likewise, if admin servers that host the cluster manager software go down, it affects the functioning of the entire system.
  • Network switching. Proper functioning of the HPC depends on seamless data flows between data storage and server. Lack of proper connection between the compute blades and data storage degrades performance.

6. Forge Working Relationships

The ROI of an HPC system depends on timely maintenance. This requires the support and cooperation of the entire ecosystem, not just the in-house IT team.

  • Have established lines of supply: HPC systems incorporate parts from several vendors. Strike a working relationship and establish a smooth process for parts ordering. Make sure the vendors source their products from different places. Multiple vendors sourcing from a single master source are a disaster!
  • Outsource maintenance tasks: Tie up with third-party vendors for maintenance, either in full or in part. HPC maintenance is a complex affair, fraught with many uncertainties. It often requires full-time attention from IT personnel. Such internal expertise may not always be available. With a reliable external expert on hand, the enterprise can save on resources for system maintenance and optimization.

As Artificial Intelligence, neural networks, and other intelligent systems gain popularity, the importance of HPC will increase exponentially. The HPC market will grow at 6.13% CAGR between 2020 and 2005. But only enterprises who optimize and maintain their HPC stack well can generate positive ROI from their investments.

 Also Read:

How Enterprises Leverage Cloud Computing for Strategic Advantage

Seven Reasons Why you Should opt for Cloud-based BI

Six Ways the Cloud Enhances the Capabilities of Business Intelligence

Tags:
Email
Twitter
LinkedIn
Skype
XING
Ask Chloe

Submit your request here, my team and I will be in touch with you shortly.

Share contact info for us to reach you.
Ask Chloe

Submit your request here, my team and I will be in touch with you shortly.

Share contact info for us to reach you.