Home Blog Page 51

Data orchestration coming to Datadobi unstructured data management

Blocks & Files interviewed Denise Natali, the new VP for Americas sales at Datadobi, over email about the firm’s views on unstructured data management. Datadobi provides a StorageMAP facility to locate, identify and manage unstructured data, building on its core migration technology. 

We wanted to gain a better understanding of Datadobi’s position on data tiering, orchestration, supplying a product vs a service, and Gen AI needs. We outlined the context in a basic diagram we sent over:

Datadobi reciprocated with its own view of the unstructured data management market:

It also sent a note explaining the terms:

That provides the background context for Natali’s answers to our questions.

Blocks & Files: What is Datadobi’s strategy for growing sales in this market?

Denise Natali

Denise Natali: Priority number one for Datadobi is to stay close to its customers and partners with whom we have such close and trusted relationships. We spend the majority of our time with customers working to understand their needs – we are dedicated to understanding their unique “why?” Only then do we help guide them towards the next steps in the unstructured data management journey – and share how StorageMAP can help them solve specific unstructured data management challenges and achieve desired outcomes.

In addition, we are dedicated to not only staying on top of, but in front of, evolving market demands/trends. So in addition to the time we spend with our customers and partners, there are folks in our organization that spend a tremendous amount of time with industry experts (such as leading industry analysts) and by reading and keeping up to date with business and technology journals (such as yours).

Drilling down a bit … StorageMAP is a versatile solution – and the market needs to know this. But, at the same time, we need to be able to demonstrate our strengths in particular areas – our ability to help our customers and partners overcome challenges and achieve their goals. For instance, one such area of focus is hybrid cloud data services. From a sales standpoint, we need to be able to prove that StorageMAP is hands-down the most robust and comprehensive solution here – a truly vendor-neutral solution, with unmatched unstructured data insight, organization, and mobility capabilities and the only solution capable of scaling to the requirements of large enterprises.

And last but certainly not least, it is critical to our sales process that we demonstrate that StorageMAP enables its customers to maintain data ownership and control – addressing customer concerns about metadata security and compliance, whether the data is managed on-premises, remotely, or in the cloud.

Blocks & Files: Does Datadobi have a data orchestration strategy?

Denise Natali: For sure. Our goal is to always remain at the forefront of unstructured data management technology. Customers and partners want increased automation and policy-driven data management capabilities – so data orchestration is an integral part of our near-term roadmap.

As you likely know, today customers are seeking solutions with data orchestration capabilities for a number of reasons – such as improved data management, enhanced data quality and consistency, increased operational efficiency, better decision making, scalability, compliance and security, and cost savings. 

Blocks & Files: Do customers want unstructured data management as a service or as a software product they buy?

Denise Natali: As with any other solution, enterprises are looking at a number of options for delivering unstructured data management. But before we get into that, it is important  to note that unstructured data management as a market is still nascent. Right now, we find that our conversations with customers are really focused on their specific and immediate needs and not necessarily a well-defined need either. We take a consultative approach to help them to explore what it is they are trying to achieve, and help them to make that a success. As our CEO likes to say, we deliver an outcome not a software product. That is what our second-to-none reputation is built upon.

Ultimately, enterprises will want an unstructured data management solution to play well with their existing infrastructure. The last thing most of them are looking for is yet another standalone point software [product] that they have to manage. Integration with their ecosystems, whether managed by them or an external party will be key.

But what is essential to all our enterprise customers, whichever path they chose, is that they maintain ownership and control over their data and metadata to make sure it remains secure and compliant with their internal policies and regulatory requirements.

At the end of the day, maintaining ownership and control is exactly why our customers prefer an on-prem solution. Many aaS offerings create security headaches for customers as they lose control of their data. At the behest of our customers, we are exploring various consumption models which are different than aaS models – the choices are not just “purchase/subscribe” vs “aaS”.

Blocks & Files: Does an unstructured data manager have to support all data silos, both on-premises and in the public cloud?

Denise Natali: Absolutely. An unstructured data manager must support all data silos – both on-premises and in the cloud. Here’s why: 

1. Hybrid Cloud Strategies – Many organizations adopt hybrid cloud strategies, maintaining a mix of on-premises and cloud-based data storage to optimize performance, cost, and security. An effective unstructured data manager must seamlessly manage data across these diverse environments.

2. Data Mobility – As organizations grow and evolve, the need to constantly move data between on-premises systems and various cloud platforms increases. Supporting all data silos makes sure that the data owners and consumers (whether people or applications such as Gen AI) can easily and quickly get the data where they need it. Liberating data from the confines of specific hardware is key to a successful hybrid-cloud strategy.  

3. Unified Management – To streamline operations and reduce complexity, organizations prefer a single pane of glass for managing their unstructured data. A unified data manager that supports all silos provides centralized control and visibility, enhancing operational efficiency. This doesn’t mean that an unstructured data manager (likely in IT) will need to do all the work, but a single point of coordination between the data custodians, data owners, and data consumers, will be vital.

4. Data Lifecycle Management – Different regulatory requirements may apply to data stored on-premises versus in the cloud. A comprehensive data manager can help enforce consistent compliance and other policies across all storage locations through the implementation and monitoring of policies created by the data owners.

5. Optimized Storage Utilization – Organizations can optimize storage costs and performance by strategically placing unstructured data based on usage patterns and access requirements. Supporting all data silos allows for intelligent data tiering and lifecycle management.

6. Scalability and Flexibility – Businesses need the flexibility to scale their storage solutions as needed. An unstructured data manager that supports both on-premises and cloud environments can easily adapt to changing storage demands as they evolve.

Blocks & Files: How does an unstructured data manager support customers adopting Gen AI technology, for training and for inferencing?

Denise Natali: The ideal unstructured data manager can support customers adopting Generative AI technology for training and inferencing through several key functions. It is important to note here that when Gen AI applications (and many other kinds of applications) claim to process “unstructured data” what they mean is they can deal with small amounts of unstructured data. Ten terabytes of unstructured data is likely considered to be fairly large for most applications.

What StorageMAP can help with is identifying the right 10TB from the multiple petabytes and billions of files that enterprises have at their disposal, making sure that only high-quality and pertinent data is used for AI model training. 

As a side note, applications that perform other tasks – such as looking for PII data within a file – have the same limitation. StorageMAP does not replace these applications, but can make them far more effective by helping to get the right data to them rather than the current approach of “best guess.”

****

By adding data orchestration functionality Datadobi will be competing with Architecta and Hammerspace, and able to upsell in its existing customer base..

Solidigm shipping PCIe 5 datacenter SSDs

Solidigm has launched a pair of datacenter SSDs using the PCIe gen 5 interface.

Greg Matson

The SK hynix subsidiary’s D7 PS1010 and PS1030 follow on from its earlier PCIe gen 4 D7-P5520 and P5620 which were built from 144-layer 3D NAND in TLC format. The new drives use 176-layer 3D NAND, still in TLC format, and are much faster, thanks in part to their PCIe 5 bus, twice the speed of the PCIe 4 bus.

Greg Matson, Solidigm’s SVP of Strategic Planning and Marketing, said in a statement: “The Solidigm D7-PS1010 and D7-PS1030 SSDs were meticulously engineered to meet the increasingly demanding IO requirements across a range of workloads such as general-purpose servers, OLTP, server-based storage, decision support systems and AI/ML. 

“In a world where every watt counts, these drives are PCIe 5.0 done right, not only delivering industry-leading four-corner performance, but also up to 70 percent better energy efficiency compared to similar drives by other manufacturers.”

Here’s a performance comparison table for the new drives and the older P5520 and P5620:

We can see that the stated random read IOPS have almost tripled while the random write IOPS have nearly doubled (PS1010) or more than doubled (PS1030). The sequential read bandwidth has also slightly more than doubled, with the write speed also more than doubling. These drives’ latency has improved as well, with read latency lessening by 20 percent and write latency decreasing 28 percent.

The capacity ranges are pretty similar to the older drives, with the mixed-use, 1 drive write per day (DWPD), PS1010 having 1.92, 3.84, 7.68, and 15.36TB variants. The read-intensive, 3 DWPD PS1030 ranges from 1.6 to 3.2, 6.4 and on to 12.8TB. They are packed in either 2.5-inch U.2 cases or the newer E3.S 15mm enclosure.

Their main characteristics are: 

Solidigm chart.

What Solidigm calls standard endurance is the lower endurance model of the two. It says the PS1010 is a mixed-use and mainstream drive with the PS1030 being mixed-use and write-centric with both having performance heavily skewed in favor of reads over writes.

Solidigm’s product brief has a detailed PS1010 performance comparison against stated values for competing suppliers Kioxia, Micron and Samsung: 

Solidigm chart.

Solidigm suggests these new SSDs can be used for HPC, general purpose servers, OLAP and cloud computing services. It makes a big thing about them being suited for AI pipeline work, functioning as an NVME data cache drive in cloud-located GPU servers, and also as an all-flash tier front-ending a disk-based object storage tier. For on-prem use it has the GPU server using them as an NVMe cache drive in front of all-QLC SSD object tier.

D7-PS1010 PS1030 case.

Ace Stryker, Director of Market Development at Solidigm, claimed in a statement: “As AI workloads continue to surge, storage performance becomes critical. The Solidigm D7-PS1010 and D7-PS1030 are a game-changer for AI-driven enterprises, capable of outperforming competitors at critical stages of the AI pipeline.”

Energy efficiency is claimed to be 70 percent better than Samsung’s PM1743.

Get a product briefing doc for the two new SSDs here.

Private equity house EQT buys majority stake in Acronis

Acronis is now owned by private equity after EQT bought a majority stake.

The Singapore-based backup software business is massive. It protects over 750,000 businesses across 150 countries, through more than 20,000 service providers offer Acronis Cyber Protect services. EQT is Europe’s largest private equity business. The cost of the stake was not revealed. Acronis founders, management, and existing investors will remain significant minority shareholders.

Ezequiel Steiner

Acronis CEO Ezequiel Steiner put out a statement: “We are thrilled that EQT shares our vision for growth and supports our strategic expansion. With EQT as strong partner, we will continue Acronis’ expansion strategy and continue to deliver the very best service to Acronis’ partners and customers.”

Johannes Reichel, Partner and Co-Head of Technology within EQT’s Private Equity advisory team, said: “Acronis is a strongly positioned cybersecurity and data protection software platform with a clear value proposition to Managed Service Providers. EQT has followed the company’s journey for many years and continues to be impressed by its performance and innovative strength. We are very excited to partner with Acronis, the management team and existing investors on its next phase of growth.”

The roots of Acronis go back to 1997 when SWsoft was founded by Russian entrepreneur Serguei Beloussov (Serg Bell) as a privately held server automation and virtualization company. Web hosting and OS partition virtualization business Parallels was started up in 1999 by Ilya Zubarev and Serguei Beloussov. It developed virtualization technology for the Mac enabling MacOS to run Windows in a parallel partition.

Beloussov, Zubarev, and Stanislav Protasov then co-founded backup and disaster recovery provider Acronis in 2003 as a Parallels spin-off. Its TrueImage product dates from then and is sold by OEMs as a PC backup, recovery, migration and DR facility. 

SWsoft bought Parallels in 2004. This had such a strong brand image that SWsoft changed its name to Parallels in 2008. Corel acquired Parallels in 2018. Virtuozzo, the sole remnant of SWsoft, is owned by a group including Serg Bell.

Serguei Beloussov

Acronis made its headquarters in Singapore in 2008 and then moved to Schaffhausen in Switzerland in 2014 to improve is effectiveness as a global business. Serg Bell was CEO from 2013 to 2021 and board chairman from 2013 to 2021. He is now Chief Research Officer and an executive board member.

Ezequiel Steiner became Acronis CEO in October 2023, taking over from Patrick Pulvermueller, who was CEO from 2021. Pulvermueller remained a board member and became a CEO advisor.

Acronis developed additional security and cloud-based offerings offerings with, for example, automation facilities for MSPs.

Blackrock and others invested $250 million in Acronis in 2022, valuing the company at $3.5 billion. A 51 percent majority ownership stake at that valuation would cost $1.785 billion. We suspect Acronis is valued at a higher amount than $3.5 billion in 2025. According to Reuters’ sources the valuation could be $4 billion.

Fellow backup and security company Veeam, also founded by Russian entrepreneurs, was bought for around $5 billion by private equity in 2020.

Serg Bell told us: “Today’s announcement is great progress. It has always been important for Stanislav and myself – the founders –  to find a partner that aligns perfectly with Acronis’s culture and vision. A partner that is committed to accelerating the deployment of advanced, state of the art cyber protection and operations solutions across the world, while maintaining the highest standards of quality and partner service. With the amount and intensity of cyber threats constantly growing, we are confident that Acronis is uniquely placed to be the best platform for Service Providers to profitably protect and operate their customers’ information technology infrastructure 

“As we celebrate this significant milestone for Acronis, we are also looking forward to devoting more time to advancing the fields of science, research and education with the team at Constructor Group which I founded in 2019. We are changing the delivery and accessibility of best science, research and education through Constructor Tech Platform. The Constructor Tech team is leveraging the tidal wave of generative AI and Metaverse to enable scientists, researchers, teachers, students and academic administrators to accelerate the technological breakthroughs that will help solve the world’s most pressing challenges. We are on the way to creating a world-renowned center of excellence for research and innovation at Constructor University graduating founders, CEOs and the C-suite leaders of tomorrow and Constructor Capital is funding and growing its deeptech, software and ed/science tech portfolio”

The EQT Acronis transaction is pending customary regulatory approvals and is anticipated to close in the first to second quarter period of 2025.

Backblaze drive type profiling predicts replacement and migration

Cloud storage provider Backblaze is developing drive type profiling models to optimize its drive replacement and migration strategies.

The work is detailed in the latest edition of its quarterly disk drive annual failure rate (AFR) statistics blog. Author and principal cloud storage storyteller Andy Klein says: “One of the truisms in our business is that different drive models fail at different rates. Our goal is to develop a failure profile for a given drive model over time.”

He started by “plotting the current lifetime AFR for the 14 drives models that have an average age of 60 months or less” drawing a chart of drive average age versus cumulative AFR and dividing it into four quadrants: 

We can instantly see that the left hand two quadrants have most drives in them, and the top right quadrant – for older drives with higher cumulative AFRs – has only two drive models in it.

Klein characterizes the quadrants as:

  • 1. Older drives doing well – with ones to the right having higher AFRs;
  • 2. Drives with >1.5 percent and around 2 percent AFRs – “What is important is that AFR does not increase significantly over time”;
  • 3. The empty quadrant – it would be populated if any of Backblaze’s drives exhibited a bathtub curve failure rate; pattern with failures in their early days, a reliable mid-period and subsequent failures as they age
  • 4. Younger drives – with low failure rates.

Next Klein drew a similar chart for drives older than 60 months: 

Now there is a more equable distribution of drives across the four quadrants. He says: “As before, Quadrant I contains good drives, Quadrants II and III are drives we need to worry about, and Quadrant IV models look good so far.” The 4TB Seagate drive (ST4000DM000) in quadrant 2 looks “first in line for the CVT migration process.” CVT stands for Backblaze’s internal Cluster, Vault, Tome migration process. [See bootnote.]

Klein next looked at the change in failure rates for these drives over time in a so-called snake chart:

This chart starts at the 24-month age point and it shows that “the drive models sort themselves out into either Quadrant I or II once their average age passes 60 months,” except for the black line – Seagate’s ST4000DM000 4TB model.

Five drives are in quadrant number 1. “The two 4TB HGST drives (brown and purple lines) as well as the 6TB Seagate (red line) have nearly vertical lines “indicating their failure rates have been consistent over time, especially after 60 months of service.”

Two drives exhibit increasing failure rates with age – 8TB Seagate (blue line) and the 8TB HGST (gray line) – but both are now levelling out.

Four drives are in quadrant 2. Three of them – the 8TB Seagate (yellow line), the 10TB Seagate (green line), and the 12TB HGST (teal line) – show accelerated failure rates over time. Klein writes: “All three models will be closely watched and replaced if this trend continues.”

The 4TB Seagate drive (ST4000DM000 and black line) “is aggressively being migrated and is being replaced by 16TB and larger drives via the CVT process.”

Looking at all these curves, Klein believes that the 8TB Seagate (ST8000DM002) is normal as it started out with a 1 percent AFR to the 60 month point and then, as expected, its AFR increased towards 1.5 percent.

He says the  two 4TB HGST drive models (brown and purple lines) have “failure rates … well below any published AFR by any drive manufacturer. While that’s great for us, their annualized failure rates over time are sadly not normal.”

Klein believes that using Gen AI large language models (LLMs) to predict drive failure rates is a no-go area for now. Training a model on one drive type’s failure profile doesn’t mean the model can predict another drive type’s failure profile. He observes: “One look at the snake chart above visualizes the issue as the failure profile for each drive model is different, sometimes radically different.”

Backblaze’s drive set data is freely available here. He points out anyone can use it but: “All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, 3) you may sell derivative works based on the data, but 4) you can not sell this data to anyone; it is free.”

Bootnote

A Tome is a logical collection of 20 drives, “with each drive being in one of the 20 storage servers in a given Vault.” A storage server could possess 60 HDDs and hence 60 unique tomes in the vault. A Cluster is a logical collection of Vaults, which can have any combination of vault sizes.

VergeIO claims fast VM migration from VMware

Virtualized datacenter supplier VergeIO reckons it can migrate hundreds of VMware virtual machines in seconds, providing a quicker and better tested exit for Broadcom VMware migrants.

Since Broadcom bought VMware and changed its business terms and conditions and forced reselling partner to re-apply to be resellers, VMware customers have been considering options – such as moving their virtual machine (VM) applications to other environments. Several suppliers of alternative hypervisor systems, including VergeIO, have been positioning themselves as VM migration targets.

Jason Yaeger.

VergeIO SVP of Engineering Jason Yaeger put out a statement saying: “At VergeIO, we understand the critical importance of minimizing operational disruption during migration. Our ability to migrate over 100 VMs in less than five seconds showcases our commitment to delivering seamless and efficient solutions for our customers.”

VergeIO claims this “significantly outpaces other alternative solutions, which average 3 to 7 minutes per VM, translating to a lengthy 5 to 12 hours for migrating 100 VMs.”

CMO George Crump tells us: “Recently, I watched a webinar by Veeam demonstrating the migration of a VMware VM to another hypervisor (Hyper-V). It took about 1 minute and 45 seconds to migrate one VM. When asked how long it would take to migrate 100 VMs, the SE estimated it would take nearly three hours, raising concerns about the server’s ability to handle that many migrations in a single pass.”

VergeIO has its VergeOS software which provides a virtual datacenter built from hyperconverged infrastructure servers. It emphasizes that storage and networking are included directly in VergeOS and do not run as VMs.

VMs are migrated from VMware to VergeOS with the IOmigrate feature, which uses VMware’s change block tracking technology to make a real-time copy of virtual machines. Verge claims it allows IT admins to log into the VMware ecosystem, view available VMs, select desired ones for migration, and within moments – near real time – these VMs are able to operate under VergeOS.

VergeIO IOmigrate diagram.

According to Broadcom, VMware’s Changed Block Tracking (CBT) is a VMkernel feature that keeps track of the storage blocks of virtual machines as they change over time. The VMkernel keeps track of block changes on virtual machines, which enhances the backup process for applications that have been developed to take advantage of VMware’s vStorage APIs.

A backup or other appliance uses the VMware API calls to request that a snapshot is created. VMware then takes the snapshot and presents it back so that a backup can be made. 

Changed Block Tracking (CBT) is a feature that identifies blocks of data that have changed or are in use. It enables incremental backups to identify changes from the last previous backup, writing only changed or in-use blocks, so helping to reduce transferred data size and time.

VergeIO claims the speed and convenience of its VM transfer technology minimizes downtime and enables the quick creation of test environments, “facilitating thorough compatibility and performance testing.” Its “migration processes ensure optimal use of existing hardware and resources,” and it claims its “optimized infrastructure enhances overall system performance post-migration.” Its boot times are 3x faster than VMware.

The company has already “successfully converted customers with thousands of VMware VMs to VergeOS.”

It will host a live demo of its VMware VM migration technology on August 15 at 1:00 pm ET. Register here.

Mastering the AI terrain: Why optimal storage is essential for competitive edge

COMMISSIONED: Enterprises adopting AI to stay competitive must tailor AI models to their needs. This means defining use cases, choosing workflows, investing in the right infrastructure, and partnering for success.

Amidst today’s intense market competition, enterprises seek to leverage AI to gain a strategic advantage. Developing proprietary AI models enables companies to tailor solutions to their unique needs, ensuring optimal performance and differentiation. Starting a project to develop AI models involves navigating a complex landscape of technological challenges and requires careful planning, problem-solving skills, and a strategic approach to AI integration.

In AI development, defining a clear use case is the initial critical step, followed by selecting an AI workflow that ensures efficiency and effectiveness, with tools that are simple, integrated, customizable, scalable, and secure. Performance sizing is key, involving benchmarking and optimizing AI models for speed and accuracy, while balancing other performance metrics. The infrastructure to support AI is extensive, requiring robust data storage, compute resources, data processing, machine learning frameworks, and MLOps platforms. And with investments in AI predicted to reach nearly $200 billion by 2025 according to Goldman Sachs reports, the economic potential is significant and necessitates substantial capital investment. Not to mention, the specialized knowledge required for AI projects often necessitates enlisting external expertise.

Each of these challenges must be carefully considered and addressed to ensure the successful development and deployment of AI models. The following step-by-step approach can help organizations address these challenges.

Step 1: Define your use case

Deploying a Generative AI (GenAI) system successfully involves a series of strategic steps, the first and most crucial being defining a clear use case. This foundational step is about understanding the specific needs and objectives of the business, which will guide the selection of the appropriate GenAI workflow. It’s essential to consider the parts of the organization that will be impacted, identify the end-users, and locate where the necessary data is stored.

Aligning GenAI’s capabilities with business goals, whether it’s generating marketing content, providing digital assistance on a website, creating synthetic data or images, or facilitating natural language code development, helps to ensure that the technology is applied in a way that adds value and drives innovation. The success of GenAI deployment hinges on this alignment, resulting in technology that serves as a powerful tool to enhance business processes, engage customers, and foster growth.

Step 2: Choose your AI workflow

Choosing the right AI workflow is crucial for the success of any AI-driven project. Starting with a clear understanding of the objective and the specific use case will guide selection of the appropriate workflow pattern.

Pre-trained models offer a quick start, as they are ready-made solutions that work out-of-the-box for a variety of tasks. Model augmentation, such as retrieval augmented generation (RAG), involves adding new knowledge to an existing model, allowing it to make informed decisions based on additional data. Fine-tuning is a more in-depth process, where the model’s existing knowledge is refined to improve its performance on specific tasks. Finally, model training from scratch is the most comprehensive approach, involving the creation of a new neural network tailored to the unique requirements of the task at hand. This step-by-step escalation in AI workflow complexity, while requiring additional time and effort to complete, allows for a tailored approach that aligns with the project’s goals and technical needs.

Step 3: Size performance requirements

When planning for AI deployment, sizing performance requirements is critical. The type of model you choose, whether it is a language model like GPT4 or an image-based model like DALL-E and Stable Diffusion, influences your compute and storage needs. Language models, while having a high number of parameters, are more compact, which means they require less storage space but more computational power to process a large number of parameters.

On the other hand, image-based models may have fewer parameters but require more storage due to the larger size of the model itself. This distinction is important because it affects how you architect your system’s infrastructure. For instance, a system designed for language models should prioritize processing power, while one for image-based models should focus on storage capabilities. Compute and storage requirements will vary depending on a model’s architecture and the task it is designed to perform so this needs to be factored into how you architect your entire AI project. Understanding these nuances can lead to more efficient resource allocation and a smoother AI workflow.

Common storage solutions for AI models include many options, each with unique benefits and best use cases. Local file storage is often used for smaller, individual projects due to its simplicity and ease of access. Network-attached storage provides more robust solutions for larger datasets, offering better performance and scalability. Distributed file systems (DFS) are ideal for large datasets that require high availability and fault tolerance, as they distribute the data across multiple machines. Object storage is another choice, especially for cloud-native applications, due to its scalability and performance with substantial amounts of unstructured data. It is important to consider the specific needs of your AI model, such as the size of the model and the number of parameters, to choose the most suitable storage solution.

Step 4: Right size your infrastructure investments

Right-sizing infrastructure investments is a critical step in developing efficient AI systems. It involves selecting the appropriate hardware that aligns with the computational demands of the AI models. For instance, smaller AI models may be able to run on optimized laptops such as Dell Precision workstations, while more complex algorithms require powerful setups, such as those with multiple GPUs like Dell’s XE9640 and XE9680 servers. Dell PowerScale offers a versatile storage solution that caters to various needs, from all-flash arrays designed for high performance to tiered storage that balances cost and scalability.

The main advantages of PowerScale for GenAI applications include its scalability, which allows starting with a small and economical setup that can grow exponentially across different environments. It also offers universal data access which allows data to be ingested, read, and written through multiple protocols. Additionally, PowerScale supports GPUDirect, allowing for high-speed and efficient data access, crucial for intensive tasks like AI training. With high-performance Ethernet and NFS over RDMA, it provides for rapid data collection and preprocessing. Lastly, its multicloud deployment capability is essential for running AI workloads in various settings, whether on-premises, at the edge, or in the cloud, providing flexibility and efficiency in AI infrastructure.

Step 5: Engage Dell resources for help

Engaging Dell resources can significantly streamline the process of integrating advanced technologies into your business operations. With step-by-step guidance, your teams can concentrate on strategic growth and innovation rather than the intricacies of implementation. Dell’s Validated Designs and Reference Architectures provide a solid foundation for building efficient IT solutions and assurance that your infrastructure is optimized for performance and reliability. Additionally, we work with our Dell partners to offer specialized AI Workshops which are designed to bring your team up to speed on the latest in AI developments and applications. For a more tailored approach, Dell Professional Services for GenAI offer expertise in deploying generative AI, helping you to quickly establish a robust AI platform and align high-value use cases to drive tangible business value.

In order to be successful with AI model implementation, you need clear guidance on defining use cases, ensuring that your AI initiatives are aligned with strategic business goals. Our Dell AI solutions are designed for efficiency and effectiveness, featuring tools that are not only simple and integrated but also customizable and scalable to meet the evolving demands of AI projects. Performance sizing resources and best practices available through Dell are streamlined with our advanced benchmarking and optimization capabilities, enhancing the speed and accuracy of AI models. The infrastructure required for AI is robust and extensive, and our solutions encompass high-performance data storage, powerful compute resources, and sophisticated data processing capabilities. Recognizing the need for specialized knowledge, we connect you with industry experts to bridge any gaps in expertise, ensuring that your AI projects are not only successful, but also cutting-edge.

To learn more about how Dell storage can support your AI journey visit us online at www.dell.com/powerscale.

Contributed by Dell Technologies.

Samsung’s mobile memory gets thinner

Samsung has crafted ultra-thin LPDDR5X memory chips built to support mobile devices with AI workloads.

LPDDR5X is low power double data rate DRAM, and Samsung has made it using 12nm-class process technology with a 0.65mm package height, likened to fingernail thickness. It is producing 12GB and 16GB capacity versions of the chip, with 4 DRAM die layers, each layer having two LPDDR DRAMs. It is nine percent (0.06mm) thinner than previous generation product, and also has 21.2 percent more heat resistance, says Samsung.

It says it has optimized printed circuit board (PCB) and epoxy molding compound (EMC) techniques and back-lapping to make the chip package thinner. The EMC material protects semiconductor circuits from various external risk, such as heat, impacts, and moisture.

Samsung reckons its memory chip will be good for AI-enhanced mobile devices. YongCheol Bae, EVP for Memory Planning, claimed: “Samsung’s LPDDR5X DRAM sets a new standard for high-performance on-device AI solutions, offering not only superior LPDDR performance but also advanced thermal management in an ultra-compact package.”

Samsung’s diagram showing airflow benefits of its 12GB LPDDR5X chip.

High capacity memory chips contribute to the heat load inside a mobile device and enabling better airflow may in some ways enable better thermal management. That said, there are no fans inside mobile devices and so the airflow is passive, convective, and not that effective as a cooling agent. Generally mobile devices use heat spreading from the warmest components, like the processor and wireless chips, within the device, transferring heat via conduction to the sealed case. This functions as a heat sink and natural convection from the case expunges unwanted heat.

Samsung’s thinner LPDDR5X package could help enable thinner mobile devices and that, in some users’ minds, could trump the thermal management capability.

The company aims to supply its new chip to both mobile processor makers as well as mobile device manufacturers. Samsung does, of course, make its own Exynos mobile processor and Galaxy smartphones, so will also be using the new chip.

The roadmap for the device has planned 6-layer 24GB and 8-layer 32GB chips. They will probably be thicker.

Bootnote

Back-lapping involves grinding a chip’s surface using abrasive slurry, a mixture of abrasive material (e.g. silicon carbide and boron carbide) suspended in water or oil, instead of fixed-abrasive wheels for a less stressful thinning process.

Pliops and Kalray merger is off

The merger talks between DPU accelerator card builder Pliops and massively parallel DPU supplier Kalray have ended.

Israel-based Pliops is adding an AI feature set to its XDP (Extreme Data Processor) card. This product’s functionality provides key:value store technology whigh is designed to offload and accelerate low-level storage stack processing from a host x86 server for applications like RocksDB and RAID.

Ido Bukspan

Pliops CEO Ido Bukspan said in a statement: “You are likely aware that we explored the possibility of joining forces with Kalray in a merger. Kalray’s impressive technology and mature products in systems and advanced computing would have greatly enhanced our GenAI system-level offerings.

“However, economic conditions – especially financial market trends – did not favor a capital-intensive merger that would align with our targets. Moving forward, the management teams of both companies will continue discussions to explore areas of cooperation and leverage synergies.”

What now? Bukspan claimed Pliops has a “bright future” because it can enable “organizations to achieve unprecedented levels of performance and efficiency in their AI-driven operations.”

The company is developing an XDP LightningAI product, “a Key-Value distributed smart storage node for the GPU Compute Domain.” It claims this can provide an up to 50 percent TCO reduction for large language model (LLM) inference workloads. It can also “maximize performance per power budget and enhance VectorDB,” Pliops said.

On X post, Pliops claimed: “Our Extreme Data Processor uses GPU key-value I/O to cut power usage & emissions.”

The hardware is not ready yet, however. Bukspan added: “Pliops LightningAI has gained significant traction with major customers and partners eagerly awaiting our demo based on our first silicon, which will be available soon.”

Pliops AI Inferencing acceleration graphic

There will be a demo of XDP LightningAI at this week’s FMS 2024 event in Santa Clara, California.

Bukspan is still on the lookout for more partners, saying: “Partnerships that bring value to our customers and support our mission to enhance GenAI application efficiencies are integral to our business model. Kalray remains a valued partner, joining our other collaborators to deliver essential solutions that will yield amazing results in the coming months.”

Bringing Lucidity to AWS, Azure and GCP block storage

Startup Lucidity claims its Autoscaler can automatically scale public cloud block storage up or down dynamically – depending on workload – without downtime, saving up to 70 percent of cost and with virtually no DevOps involvement.

Vatsal Rastogi (Left) and Nitin Singh Bhadauri (right)(.

Lucidity was founded in 2021 in Bangalore, India, by CEO Nitin Singh Bhadauria and Vatsal Rastogi who drives the technical vision and manages engineering and development. Rastogi was previously  a software developer involved in creating Microsoft Azure. 

It has taken in $5.6 million in angel and VC pre-seed and seed funding. The investors were responding to a proposal that Lucidity develop auto-scaling software to manage the operations involved in scaling up and shrinking public cloud persistent block storage. Its autoscaler would have agents in each virtual machine running in AWS, Azure or GCP, monitor the metrics produced, and then automatically grow or shrink storage capacity as required with no downtime and no need for DevOps involvement.

Lucidity explains that there is a common storage management cycle for AWS, Azure and GCP. This cycle has four basic processes: plan capacity, monitor disks, set up and respond to alerts. Lucidity says it’s built an autonomous orchestration layer to manage this and reduce storage costs, and reduce DevOps involvement by using auto-scaling.

Its GCP and Azure autoscaler documentation claims that “Shrinking a disk without downtime especially has not been possible so far. Lucidity makes all of it possible with no impact on performance.”

The autoscalers rely on a concept of providing logical disks, made up from several smaller disks, which can be added to the logical disk pool or removed from it. A ‘disk’ here means a drive – which can be SSD or HDD-based.

A Lucidity agent runs inside each VM and sends metrics to a Lucidity server which “analyzes the usage metrics for each VM instance to figure out the right storage configuration and keeps tuning it continuously to ensure that the workload always gets the resources it needs.”

Lucidity metrics data flow diagram.

This is an out-of-band process. The Lucidity server calculates when to scale capacity up or down. Lucidity GCP autoscaling documentation states: “The Lucidity Scaler is able to perform expand and shrink operations at a moment’s notice by using a group of disks attached to an VM instance as opposed to a single disk. The group of disks is composed together to form a single disk (or a logical disk).”

When a volume is shrunk, Lucidity’s server software works out which disk to remove, copies its data to other drives to rebalance the system, and then removes the empty disk. There is no performance impact on a running application. 

Lucidity Autoscaler decision tree.

This storage orchestration “is powered by Windows Storage Spaces behind the scenes.” Lucidity software runs on top of this and uses the BTRFS filesystem in Linux “to group disks and create a logical disk for the operating system. The operating system would be presented with one single disk and mount point which the applications would continue to use without any impact.”

The AWS autoscaler handles cost optimization by navigating the many and varied AWS EBS storage instances. There are, for example, four SSD-backed EBS volume types and two HDD-backed ones. Lucidity states: “As a result of various performance benchmarks and uses of the various types of EBS volumes, the pricing of these can become challenging to grasp. This makes picking the right EBS type extremely crucial, and can cause 3x inflated costs than required, if not done correctly.” 

Luvidity’s table of AWS EBS volume types and costs.

There are also two types of idle volume and AWS charges on provisioned – not used – capacity. A Lucidity ROI calculator facility can provide more information about this.

Lucidity Community Edition – providing a centrally managed Autoscaler hosted by Lucidity – is available now. A Business Edition – enabling users to “access the Autoscaler via a Private Service Connect that ensures no data is transferred over the internet” – is coming soon. An Enterprise Edition wherein the Autoscaler “is hosted within the cloud account and network of the user for exclusive access” is also coming soon.

Cisco: Move on. Nothing to see here. We’re still selling Select Cohesity services

A Cisco end-of-life notice confirmed it is going to stop selling certain Cisco Select Cohesity Solutions, initially with no replacement option specified but now with alternatives.

Update: Cisco and Cohesity provided explanatory statement, 6 August 2024.

Cohesity entered Cisco’s Solutions Plus program in March 2019, enabling its internal sales teams to sell Cohesity software products running on UCS servers. There was joint sales, marketing, service and support, and product roadmap alignment between Cohesity and Cisco, which is a strategic investor in Cohesity.

Cisco’s end-of-life notice, emitted at the weekend, announced that November 4, 2024 is the end-of-sale date, and identified Cohesity SiteContinuity Delivered as a Service, DataProtect and Flexible DataProtect Delivered as a Service, and Replica or Retention Delivered as a Service as the affected Cohesity offerings.

An event that may be related to this EOL notice is that Cisco made a strategic investment in anti-ransomware startup Halycyon at the start of this month. Janey Hoe, VP at Cisco Investments, stated, “Ransomware has been a perpetual Cybersecurity pain-point for enterprises across the world.

“Halcyon’s deployment of bespoke artificial intelligence models to proactively identify ransomware strains, as well as instantaneous recovery of hijacked data, represents a robust AI enabled response to an endemic problem. We are proud to have Halcyon become a part of our $1B global AI-investment fund, announced earlier this year.”

Only last month, Cisco blogger Jeremy Foster wrote that since formulation of their agreement in 2019: “Cisco and Cohesity have produced five Cisco Validated Designs (CVDs) and multiple open-source contributions in GitHub repositories. These prebuilt integrations, reference architectures, and solution guides cover a number of use cases for simplifying data management and strengthening data security, including backup, archiving, disaster recovery, ransomware protection and recovery, file and object services, dev/test, and analytics.”

Cisco and Cohesity had more than 460 joint customers and their alliance expanded to include Cohesity Cloud Services, described by the pair as “a portfolio of fully managed, cloud-native, as-a-service (aaS) offerings that provide backup and recovery, threat detection and data classification, and cyber vaulting services.”

Foster wrote: “With more apps and workloads generating more data than ever before, there is a historic demand to derive value from this data. But that also means there are more points of complexity and vulnerability than ever before for companies who need to manage and secure all that data across multiple environments.

“As long as this is the reality, Cisco will continue to work closely with strategic partners like Cohesity to ensure that our customers can protect and manage their data better so that they can unlock real value from what is likely to be their most important asset.”

Furhter back in August 2023, Cisco integrated Cohesity DataProtect data protection and DataHawk ransomware protection capabilities into its extended detection and response (XDR) offering.

We have asked both Cisco and Cohesity for an explanation of why this EOL notice is being issued and what it means for their relationship. A joint Cisco-Cohesity statement said: “Cisco and Cohesity’s strategic relationship remains unchanged. This EOL notice simply pertains to SKU’s being deprecated as part of a price list update and retirement of certain offers by Cohesity.  

“Four of the offers have been replaced with new SKU’s; the EOL notice is updated to reflect that. The remainder are offers being EOL’d by Cohesity have been rolled into new offers. There are currently 38 Cohesity “as-a-Service” SKUs on the Cisco price list. Additionally, the Cisco price list includes Cohesity Service and Support SKUs including Cohesity data replication and management software, such as Cohesity DataProtect, Replica, SmartFiles and Archive, as well as Cohesity’s catalog of professional services, training and data platform SKUs. 

“Our hundreds of joint customers can expect Cisco and Cohesity to continue their strong relationship, and we’re committed to working closely together to continue to offer innovative solutions. The small number of customers impacted by the EOL announcement will be provided, where appropriate, migration paths to newer offerings.” 

IEEE report: When it comes to SSD and HDD tech roadmaps, money talks

An IEEE report sees no mass takeover of the disk drive market by SSDs because HDD cost/bit is decreasing fast enough to prevent SSDs catching up.

The International Roadmap for Devices and Systems 2023 Update (IRDS) is discussed in Tom Coughlin’s August Digital Storage Technology Newsletter (subscription details here.) It covers many mass storage technologies: SCM, NAND/SSDs, HDDS, tape and emerging memories including DNA. 

We’re focusing on SSDs and disk drives here and the IEEE report notes that HDDs have ceded the fast data access market to SSDs: “As flash-based SSD costs have fallen, HDDs have been displaced: first in consumer PCs, and increasingly in data centers, as SSDs have come to occupy the tier of frequently, and randomly, accessed data.”

Nowadays: “HDDs are used for bulk mass storage and SSDs are used for speed. Frequently requested and typically randomly accessed data resides in the SSDs, while less-frequently used and typically sequentially accessed data is kept in high-capacity HDDs. “

However: “SSDs and HDDs continue to coexist because HDDs (and tape) will continue to offer the lowest cost per bit for the foreseeable future.” 

As a consequence: “Today the HDD market continues its decline in unit volume primarily due to displacement by solid state drives.” The IEEE report sees total HDD unit ships increasing from 2023 to 2028 as the nearline mass storage market increases in size: 

The mobile, branded, consumer electronics and retail HDD markets continue to decline over this period, or so the forecast states.

SSD capacity is increasing as layer counts in 3D NAND increase and drive down the cost/bit, but the rate of increase is slowing. This is because adding layers also adds cost. The alternative way of increasing capacity, adding bits to cells, has reached a possibly temporary wall with QLC (4bits/cell) as the next 5bits/cell; level – penta-level cell or PLC – has such short endurance as to make infeasible, for now. 

The IEEE report has a table from its IRDS (International Roadmap for Devices and Systems) community showing NAND chip density progression out to 2029: 

IRDS does not commit itself to saying QLC will displace TLC from 2025 onwards, instead using the term TLC+.  We are currently just entering the high 200s layer count area with Micron’s 276-layer technology with a 1Tb die. 

The table shows 300+ layers in the 2027 period and 500+ in the 2029 timeframe, leading to a doubling in maximum die capacity from 2TB in 2025, to 4Tb in 2027 and 8Tb in 2029.

Despite this capacity increase, SSD cost/bit will not become equal to or go below HDD cost/bit, the report states: “The ultra-high capacity HDD market will remain unavailable to flash for the foreseeable future due to its lower cost per bit.”

No alternative semiconductor-based memory technologies, either volatile or non-volatile, will replace NAND or DRAM. Keeping the fate of Intel’s Optane mind, this is because their performance and other characteristics do not justify a supplier bearing the cost of ramping production to a volume level where the cost/bit, performance, endurance and power consumption means customers will switch from either DRAM or NAND.

Technologies like ReRAM, MRAM, etc. will remain niche embedded market alternatives because of this restriction.

HDDs will increase their capacity, and therefore lower their cost/bit as a table indicates: 

All the suppliers – Seagate, Toshiba and Western Digital – will move to HAMR technology, and use patterned heated dot media in the 2037 period:

The report states: “Patterned media has discrete elements of the magnetic material distributed in an orderly fashion across the disk surface. These patterned dots become the magnetic bits when the head writes on them. Patterned media will be combined with HAMR recording to create 10 Tb/in2 magnetic recording areal density within the next 15 years.” 

One enduring HDD issue is access density; IO operations per second per byte. The report says: “As the storage capacity on a surface increases, the time it takes to access a given piece of data increases and the time to read or write all the data on the disk increases as well. Although the average performance of disk drives to access data is increasing by about 10 percent per year … the disk access density (Access density = I/Os per second per gigabyte) of disk drives is continually decreasing as disk drive capacity increases faster than disk drive performance.” 

Despite advances such as multi-actuator technology, access density will continue to decline as HDD capacity increases. NAND caching, as with WD’s OptiNAND technology, can help mitigate this.

Altogether the IEEE report makes for fascinating reading and there is much more information in its 117 pages than we can cover here. It is a terrific primer and introduction into current and future mass data storage technologies.

IBM Storage Scale using Blue Vela AI supercomputer

IBM’s Vela AI supercomputer, described here, was not powerful enough for IBM Research’s AI training needs. In 2023, it began the Blue Vela development to supply the major expansion of GPU compute capacity required to support AI model training needs. To date, Blue Vela is actively being used to run Granite model training jobs.

Update. Performance comparison between Vela and Blue Vela added. 12 Aug 2024.

IBM Blue Vela graphic

Blue Vela is based around Nvidia’s SuperPod concept and uses IBM Storage Scale appliances as we shall see.

Vela is hosted on the IBM Cloud but the Blue Vela cluster is hosted in an IBM Research on-premises datacenter. This means that IBM Research has ownership and responsibility for all system components, from the infrastructure layer to the software stack.

Blue Vela system layers

As the number of GPUs needed to train larger and more connected models increases, communication latency becomes a critical bottleneck. Therefore, the design of Blue Vela originated with the network and Blue Vela is designed around four distinct purpose-built networks.

  • A compute InfiniBand fabric, which facilitates GPU-to-GPU communication, as shown below
  • A storage InfiniBand fabric, which provides access to the storage subsystem, as shown below
  • An in-band Ethernet host network is used for inter-node communication outside the compute fabric
  • An out-of-band network, also called the management network, which provides access to the management interfaces on the servers and switches

Blue Vela is based on Nvidia’s SuperPod reference architecture. It uses 128-node Compute Pods. These contain 4 x Scalable Units, each of which contain 32 nodes. The nodes contain Nvidia H100 GPUs. Nvidia’s Unified Fabric Manager is used to manage the InfiniBand networks comprising the compute and storage fabrics. UFM can help recognize and resolve single GPU throttling or non-availability, and it is not available for Ethernet networks.

A compute node is based on Dell’s PowerEdge XE9680 server and consists of:

  • Dual 48-core 4th Gen Intel Xeon Scalable Processors
  • 8 Nvidia H100 GPUs with 80GB High Bandwidth Memory (HBM)
  • 2TB of RAM
  • 10 Nvidia ConnectX-7 NDR 400 Gbps InfiniBand Host Channel Adapters (HCA)
    – 8 dedicated to compute fabric
    – 2 dedicated to storage fabric
  • 8 x 3.4TB Enterprise NVMe U.2 Gen4 SSDs
  • Dual 25G Ethernet Host links
  • 1G Management Ethernet Port

IBM “modified the standard storage fabric configuration to integrate IBM’s new Storage Scale System (SSS) 6000, which we were the first to deploy.”

These SSS appliances are integrated scale-up/scale-out – to 1,000 appliances – storage systems with Storage Scale installed. They support automatic, transparent data caching to accelerate queries.

Each SSS 6000 appliance can deliver upwards of 310 GBps throughput for reads and 155 GBps for writes across their InfiniBand and PCIe Gen 5 interconnects. Blue Vela started with two fully populated SSS 6000 chassis, each with 48 x 30 TB U.2 G4 NVMe drives, which provides almost 3 PB of raw storage. Each SSS appliance can accommodate up to seven additional external JBOD enclosures, each with up to 22 TB, to expand capacity. Also, Blue Vela’s fabric allows for up to 32 x SSS 6000 appliances in total.

IBM says the maximum effective capacity is up to 5.4 PB based on FCM drives and 3:1 compression, which will depend upon the characteristics of the data stored in the FCM

Blue Vela has separate management nodes using Dell PowerEdge R760XS servers, and utilized to run services such as authentication and authorization, workload scheduling, observability, and security.

On the performance front, the paper authors say: “From the onset, the infrastructure has demonstrated good potential in throughput and has already shown a 5 percent higher performance out-of-the-box compared to other environments of the same configuration.” 

“The current performance of the cluster shows high throughputs (90-321B per day depending on the training setting and the model being trained).” 

Blue Vela performance

Vela vs Blue Vela

A comparison between the Vela and Blue Vela systems would use this formula: # Training Days = 8 * #tokens * #parameters/(#gpus * flops per GPU). On this basis;

  • IBM Vela – 1100 x A100 GPUs for training and theoretical performance = 300 teraFLOPs/GPU (bf16)
  • IBM Blue Vela – 5000 x H100 GPUs for training and theoretical performance = 1,000 teraFLOPs/GPU (bf16)

This makes Blue Vela more than three times faster than Vela.

There is much more detailed information about the Blue Vela datacenter design, management features and software stack in the IBM Research paper