My notes on the VAST storage platform¶
Estimated time to read: 24 minutes
- Originally Written: July, 2024
Overview¶
I saw an overview video of VAST at one of the Tech Field Days and wanted to find out more. These are some of my notes from reading documentation and watching presentations from the VAST team. Leave a comment if you find an error.
These are the primary resources I used to create the notes and have referenced any other sources I've found along the way
Are storage requirements for AI workloads different from traditional workload requirements?¶
-
Based on the references below, AI/ML/Deep learning workloads require:
- High performance storage
- A lot of it (100s Terabyte - Petabyte - Exabyte)
- Affordable flash/Financially feasible i.e. not Petabytes of storage at the traditional all flash pricing
- Shared/parallel access optimized for random IO
-
ML/Deep learning requirements:
- Reference: https://www.vastdata.com/resources/forms/analyst-papers/demystifying-ai-io-performance
- I have highlighted a couple of points
- Deep learning (DL) IO patterns are characterized by almost 100% read workloads and are dominated by random reads, in often small to medium IO sizes
- These requirements are well suited for SSD or flash storage but not well suited for HDD or disk technology
- Higher performing storage can keep GPUs (and CPUs) busier, thus helping IT organizations justify the DL GPU hardware expense.
- In addition, DL model training traditionally consumes a lot of data.
- Image recognition models, recommendation engines and the like train on massive datasets
- Large language models that train over protracted periods, require the fastest performing storage
- Moreover, the nature of Neural Network (NN) training requires repeated passes over the entire dataset to calibrate models to perform effective inferencing.
- As such, having the training data dispersed over different tiers of storage may not perform well as this requires data movers plus orchestration and adds complexity to avoid resource conflicts.
- Finally, to speed up AI DL model training and inferencing, enterprises often deploy multiple GPUs, and in many cases, multiple GPU servers working cooperatively to process data or models in parallel, which requires high-performing shared storage optimized for random rather than sequential IO.
- Inferencing considerations that drive IO are simple in comparison with training
- The focus is on the server query transaction or batch file processing rates required
- For data and model parallelism, the number of GPUs in operation will determine bandwidth needs
- These characteristics also drive logging IO bandwidth
- In general, inferencing consumes data at ~10X the training rate for the same models
- Thus, the higher the transaction or file inferencing rate and the higher the number of GPUs in use, the higher transaction read, and logging write bandwidth will be needed
- Deep learning (DL) IO patterns are characterized by almost 100% read workloads and are dominated by random reads, in often small to medium IO sizes
- I have highlighted a couple of points
- Reference: https://www.vastdata.com/resources/forms/analyst-papers/demystifying-ai-io-performance
-
ML Data Sizes
- Reference: https://developers.google.com/machine-learning/data-prep/construct/collect/data-size-quality
- As a rough rule of thumb, your model should train on at least an order of magnitude more examples than trainable parameters. Simple models on large data sets generally beat fancy models on small data sets.
- Reference: https://www.vastdata.com/blog/a-checkpoint-on-checkpoints-in-llms
- The datasets could be on the order of petabytes in size, and the models themselves are also hundreds of gigabytes in size.
- Reference: https://developers.google.com/machine-learning/data-prep/construct/collect/data-size-quality
VAST 30 second overview¶
-
Focused on storage for artificial intelligence (AI) and deep learning infrastructure
-
VAST DataStore
- High-performance, scale-out storage solution for unstructured data designed to eliminate traditional tiered storage (see pyramid below)
- File/Object/Volume/Table
- Optimized for cost/GB in Petabyte to Exabyte range
- Delivers sub to single digit millisecond latency seen in all flash systems
- For microsecond latency you would still use in-memory solutions
- Stated at 1:33:30 here: https://www.youtube.com/watch?v=IcwB7DnObdM&t=1335s
- Limited to a single DC or geographical area
- VAST DataSpace
- Joins VAST clusters into a constellation that can present a global namespace across multiple locations worldwide
- The administrator of that DataSpace can present folders to users and hosts in multiple locations simultaneously
- VAST DataBase
- Database management system (DBMS) that includes the constructs of a traditional database, data warehouse, and data lake
- VAST DataEngine
- Global function execution engine
-
Gemini
- Software licensing model which enables customers to separate the hardware purchasing cycle from the software licensing cost
References:
- https://speakerdeck.com/itpresstour/vast-data-it-press-tour-june-2020?slide=8
- https://www.youtube.com/watch?v=1p-AwYBfKC4
Vast Data Platform¶
- With legacy architectures the only way to efficiently store data was with multiple tiers of storage, each providing a unique price/performance proposition
The platform¶
- Scales to the exabytes of data needed to train the most advanced AI models
- Still delivers the sub to single digit millisecond latency and five 9s reliability users expect from an all-flash array
- Supports parallel I/O from thousands of clients and deliver the strict atomic consistency to support transactional applications
- Achieves this at a price customers can afford when buying petabytes at a time
- Frontend consists of many stateless controllers which add performance linearly
- Backend consists of JBOF (Just a Bunch Of Flash) enclosures which communicate via NVMe over fabrics to the controllers
- Starting point is 1 PB usable (with data reduction)
- Systems have 600 TB real usable capacity before data reduction
- Data reduction is competitive with HDD based architectures from a cost perspective
- File/Object/Volume/Table storage
How to achieve All-flash performance at HDD cost¶
- Use lowest cost flash
- Consumer grade QLC
- Don't waste any space with metadata
- Metadata located on SCM not on flash storage
- Data reduction
- Deduplicate, Compression, and Similarity reduction
- Global data protection
- Very wide erasure-coding stripes which reduces wasted overhead
- VAST designed a new class of erasure code borrowing from a new algorithmic concept called locally decodable codes
Architecture¶
-
The VAST Data platform is built on the idea of disaggregated, shared-everything (DASE) architecture and includes the following components:
-
Disaggregated, Shared-Everything (DASE)
- Separates the storage media from the CPUs that manage that media and provide storage services
- The disaggregated storage, including all the system metadata, is shared by all the VAST Servers in the cluster
- Allows users to scale the capacity of a VAST Cluster independently from the compute resources of the cluster
- Add JBOF (Just a Bunch Of Flash) enclosures for capacity and servers for performance
- NVMe-over-Fabrics (NVMe-oF)
- Provides high-speed, low-latency access to storage over a network using protocols that have been around for years e.g. Fibre Channel, Ethernet and InfiniBand
- Dr J Metz has a great post on NVMe over Fabrics for Absolute Beginners if you want to learn more.
- Storage Class Memory (SCM) SSDs
- Used as high-performance write buffer and as a global metadata store
- Provide higher performance and endurance than commodity NAND flash
- Fills gap between flash and DRAM in performance
- Originally used Intel Optane (based on 3D Xpoint technology)
- Are now using SCM solutions from other vendors since Intel are discontinuing their Optane SSDs
- QLC (quad-level cell NAND) Flash Storage
- This is the cheaper/consumer grade flash storage which allows for higher density storage at a lower cost per gigabyte compared with enterprise grade flash storage
- They're able to use consumer grade flash as redundancy and error handling is performed in their software and other components such as the Storage Class Memory
More Technical Details¶
- These are a few more notes taken from a mix of resources including the whitepaper previously referenced
Shared-nothing vs shared-everything¶
-
Shared-nothing
- Legacy scale-out architectures
- Aggregate computing power and capacity into either shared-nothing nodes with a single “controller” per node or shared-media nodes with a pair of controllers and their drives
- Users are forced to buy computing power and capacity together across a limited range of node models while balancing the cost/performance.
- Shared-nothing users face a difficult challenge when their vendors release a new generation of nodes
- Because all the nodes in a storage pool have to be identical, a customer with a 16-node cluster of 3-year-old nodes who needs to expand capacity by 50% faces two choices:
- Buy 8 additional nodes and extended support for the current 16 nodes
- Extended support only available for total 5 or 6 years so replacement of all 24 nodes will be required in 2-3 years
- Buy 5 new model nodes with twice the density and create a new pool
- Performance will depend on pool data is in
- Multiple pools add complexity
- Lower efficiency from small cluster
-
Shared-everything
- Disaggregates computing power into CNodes that are independent of the DBoxes that provide capacity
- See below for CNode vs DNodes/DBoxes
- Every CNode has direct access to every SSD in the cluster
- Examples:
- Customers training their AI models (a workload that accesses a large number of small files) or processing complex queries using the VAST DataBase will use as many as a dozen CNodes per DBox
- Customers using their VAST clusters to store backup repositories, archives, and other less active datasets typically run their clusters with less than one CNode per DBox
- You can expand a VAST cluster by adding more DBoxes without the cost of adding more compute power
- CNodes in a cluster are treated as a pool of computing power, scheduling tasks across the CNodes the way an operating system schedules threads across CPU cores
- When the system assigns background tasks, like reconstructing data after an SSD failure or migrating data from the SCM write buffer to hyperscale flash, tasks are assigned to the servers with the lowest utilization
- Faster CNodes will be capable of performing more work and will therefore be assigned more work to do
- SSDs in the cluster are also managed as pools of available SCM and hyperscale flash capacity
- When a CNode needs to allocate an SCM write buffer, it selects the two SCM SSDs that have the most write buffer available and are as far apart as possible (in different enclosures if the system has more than one)
- Similarly, when the system allocates an erasure-code stripe on the hyperscale flash SSDs it selects the SSDs in the cluster with the most free capacity and endurance for each stripe.
- Since SSDs are chosen by the amount of space and endurance they have remaining, any new SSDs or larger SSDs in a cluster will be selected for more erasure-code stripes than the older, or smaller SSDs until the wear and capacity utilization equalize across the SSDs in the cluster
- The result is that VAST customers are never faced with choosing between buying a little more of the old technology they’re already using, knowing these nodes will have a short working life, or replacing the whole cluster even though their current nodes have a few more years of life
- When a VAST customer’s cluster requires more compute power, they don’t have to upgrade all their nodes to the new model with a faster processor; instead, they can just add a few CNodes.
- Disaggregates computing power into CNodes that are independent of the DBoxes that provide capacity
Servers/CNodes vs HA Enclosure/Dboxes¶
- VAST Servers
- Also known as CNodes
- VAST Data Platform runs as a group of stateless containers across a cluster of one or more x86 servers
- Containers make it simple to deploy and scale the VAST Data Platform as software-defined microservices
- The upgrade process for VASTOS instantiates a new VASTOS container without restarting the underlying OS, which reduces the time a VAST server is offline to a few seconds
- Stateless
- Any user request or background task that changes the state of the system (e.g. garbage collection or rebuilding after a device failure) is written to multiple SDDs in the cluster’s DBoxes before it is acknowledged or finally committed
- Do NOT cache writes or metadata updates in DRAM
- The ultra-low latency of the direct NVMe-oF connection between CNodes and the SSDs in DBoxes means the CNodes don't need to maintain read or metadata caches in DRAM
- Because CNodes don’t cache, they avoid all the complexity and east-west, node-to-node traffic required to keep a cache coherent across the cluster.
- All CNodes in a cluster mount all SCM and hyperscale flash SSDs in the cluster via NVMe-oF at boot time
- Every CNode can directly access all the data, and metadata, across the cluster.
- In the DASE architecture everything (storage device, metadata structure, the state of every transaction within the system) is shared across all the CNode servers in the cluster
- Nodes don’t own storage devices or the metadata for a volume
- When a CNode needs to read from a file:
- It accesses that file’s metadata from SCM SSDs to find the location of the data on hyperscale SSDs
- Then reads the data directly from the hyperscale SSDs
- There’s no need for that CNode to ask other nodes for access to the data it needs
- Each CNode can process simple storage requests, such as read or write to completion, without having to consult the other CNodes in the cluster
- More complex requests will be parallelized across multiple CNodes by the VAST DataEngine
- HA Enclosures
- Also known as DBoxes or data boxes
- NVMe-oF storage shelves
- Connect SCM and hyperscale flash SSDs to an ultra-low latency NVMe fabric using Ethernet or InfiniBand
- Two DNodes per box and all enclosures are highly redundant
- Do not execute any of the storage logic of the cluster
- CPUs can never become a bottleneck as new capability is added to the VAST Data Platform
- Don’t aggregate SSDs to LUNs or provide data services (unlike other storage systems)
- Two DNodes within a DBox run active-active
- Each DNode presents half the enclosure’s SSDs to the NVMe fabric
- When a DNode goes offline, the surviving fabric module’s PCI switch remaps the PCIe lanes from the failed I/O module to the surviving DNode while retaining atomic write consistency
The VAST Element Store¶
- Defines how VAST’s data platform store files and objects and the metadata that describes them
- Neither a traditional file system nor an object store
- Abstracts both to create an abstraction that serves both the hierarchical presentation of a file system and the flat presentation of an object store
- This abstraction allows users to access the same element as a file over NFS with one application and as an S3 object when using more modern “cloud-native” applications.
- Uses metadata to assemble the physical layer’s data chunks into the Data Elements like files, objects, tables and volumes
- Users and applications can interact with these elements
- Organizes them into a single global namespace across a VAST cluster (or worldwide with VAST DataSpace)
- Manages three types of data elements:
- File/object
- Stored by the system as strings of data chunks
- Accessed through the NFS, SMB, and S3 protocols
- Volume
- Stored as strings of data chunks.
- Table
- Hold structured, tabular data in a columnar format
- Accessed via SQL or as virtual Parquet files through NFS, SMB, and/or S3
- File/object
Asymmetric Expansion¶
-
CPU and Storage sizes/performance do not need to be consistent across the cluster
- i.e. you can expand the cluster with higher performance CPUs and larger drives when they're available in the future
-
https://www.vastdata.com/blog/redefining-simplicity-at-scale-welcome-to-the-asymmetric-era
- In a legacy storage architecture:
- Compute needs to be the same at least within a storage pool or storage tier because systems need to provide a consistent performance experience in cases of failover as well as for parallel I/O operations
- Disks need to be the same within a stripe group because performance needs to be consistent across drives and failure recovery mechanisms need assurances that the remaining disks have sufficient capacity to rebuild what has failed within some rigid stripe set
- How VAST differs
- Storage tasks are allocated globally according to system load which makes it possible to allocate tasks such as data reduction or data protection to servers that experience the least I/O load
- VAST’s global load scheduler is resource aware, assigning tasks to VAST Servers in the cluster based on their availability. As you add more powerful VAST Servers to a cluster, those servers automatically take on a larger portion of the work in proportion to the fraction of total CPU horsepower they provide in the cluster. Your new and old servers work together, each pulling its weight, until you decide they should go.
- Disks can be different sizes and when one fails it does not need to be replaced with a disk of equal size
- As new drives come into the cluster – the system simply views the new resources as more capacity that can be used to create data stripes
- When larger SSDs are added to a cluster those SSDs will be selected for more data stripes than the smaller drives so the system will use the full capacity of each drive.
- In a legacy storage architecture:
How can consumer grade QLC Flash be used while still offering the performance and durability required?¶
-
System designed for very low endurant media
- A lot less writes are required
- Have overcome consumer grade flash challenges such as poor write speed and power protection
- Metadata stored elsewhere (on SCM) so no wasted space
- New data reduction algorithms
- Dedupe, compression, similarity reduction
- New data protections algorithms
- Erasure coding using very wide stripes and locally decodable codes
-
Less Writes
- VAST doesn't need more than 50 write cycles
- Write cycles or endurance is very amortizable
- If you have 50 write cycles and you're keeping something forever, that cell has 49 remaining write cycles (after you've written the data you're keeping forever)
- Those 49 remaining write cycles can be used for more dynamic applications
- You no longer need to remove data around anymore
- The reason you moved it was to free up expensive high performance disk space
- How do they deal with Write Amplification?
- Write amplification is
- an “undesirable phenomenon associated with flash memory and solid-state drives (SSDs)".
- "Due to the way flash works, much larger portions of flash must be erased and rewritten than actually required by the amount of new data. This multiplying effect increases the number of writes required over the life of the SSD, which shortens the time it can operate reliably
- https://en.wikipedia.org/wiki/Write_amplification
- Don't let the drive deal with the writing
- VAST writes to the drive in full erase blocks
- 3D Xpoint/SCM buffering ensures full-stripe writes
- VAST has global historical visibility of the file and directory structure which they keep in the metadata (on the SCM)
- This allows them for example to put all data that is expected to be overwritten in a week from now into the same erase block
- This is called predictive placement
- Write amplification is
-
Consumer grade flash challenges they've overcome with SCM (Optane/3D Xpoint)
- Poor write speed
- Avoided by first writing to SCM and then migrate off to flash in the background
- The application gets the SCM low latencies as it's written there first
- No power protection
- Nothing is removed from SCM before confirmation that it's written to flash
- Poor write speed
-
Global Data protection
- Tradeoff between resiliency and capacity efficiency
- How is this tradeoff broken?
- Very wide stripes
- Instead of traditional RAID with 1 or 2 parity bits they use n+4
- How is this tradeoff broken?
- “It’s obvious that an erasure code that writes 8 data strips and 2 parity strips per stripe, abbreviated 8D+2P, has more overhead (20%) than a 14D+2P stripe would (12.5%).”
- Because the DASE architecture uses highly available storage enclosures (DBoxes), VAST systems can stripe across all the hyperscale SSDs in a cluster safe in the knowledge that no single failure will take more than one SSD offline.
- Because there are many more SSDs than enclosures, this means DASE clusters can write erasure code stripes much wider, up to 150 total strips per stripe
- This achieves much lower overhead than a shared-nothing architecture where a node is a unit of failure. For example:
- A small collection of VAST JBOFs can provide redundant access to a collection of drives striped across the enclosure in a 146+4 SSD write stripe
- If a shared-nothing cluster was to implement this same stripe width, a system would require nearly 40 storage servers to achieve this same level of low overhead
- Wide write stripes are the means to maximize storage efficiency, but also increase the probability that multiple devices within a stripe could fail
- While flash SSDs are very reliable devices, especially when compared to HDDs, it’s simply more statistically likely that two of the 148 SSDs in a 146+2 stripes will fail, than 2 stripes out of the 12 SSDs in a stripe coded 10+2.
- To support wide stripes VAST systems always write erasure-code stripes with four parity strips
- This allows a VAST system to operate with as many as four simultaneous SSD failures.
- The other aspect of ensuring high resilience in a wide-write-striping cluster is to minimize the time to recovery
- Recovery in traditional systems required the reading of all surviving data strips in a protection stripe, and one or more parity strips, to regenerate data from a failed drive
- This wouldn't be practical with very wide stripes
- VAST Data designed a new class of erasure code borrowing from a new algorithmic concept called locally decodable codes
- The advantage is that they can reconstruct, or decode, data from a fraction of the surviving data strips within a protection stripe
- That fraction is proportionate to the number of parity strips in a stripe
- A VAST cluster reconstructing a 146D+4P data stripe only has to read 38 x 1-data strips, just 1 quarter of the survivors.
- Tradeoff between resiliency and capacity efficiency
-
Data Reduction
- VAST DataStore uses compression and deduplication and also adds a new technique called similarity reduction
- When data is written to the VAST DataStore, it is first written to the storage-class memory (SCM) write buffer and acknowledged
- The VAST DataStore then performs data reduction as part of the process of migrating data from the SCM write buffer to the hyperscale flash capacity pool
- Since writes are acknowledged to applications when written to the SCM write buffer, VAST systems have plenty of time to perform more meticulous data reduction
- As long as the system drains the write buffer as fast as new data is written to the system, the time it takes to reduce the data has no impact on system performance or write latency
- As with conventional deduplication, when multiple chunks of identical data are written, the system stores a single copy of the data and uses metadata pointers for the rest
- Similarity reduction
- This reduces the amount of storage needed to store data chunks that are similar, but not identical, to existing chunks
Erase blocks and Write cycles¶
- I thought this was a good explanation
- Reference: https://www.reddit.com/r/explainlikeimfive/comments/9pjd4n/comment/e824i0k/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
- SSDs are organized in erase blocks, which are the smallest unit that can be erased, which are subdivided in pages, which is the smallest unit that can be written.
- There's a limit to how many times an erase block can be erased. It doesn't correspond directly to either time or gigabytes.
- Any time the disk needs to overwrite anything at all, it needs to clear an entire erase block, and that causes wear. A big problem there is write amplification.
- Suppose a device made of 1 erase block of 4 pages. It starts all erased.
- Write to page 1: no wear.
- Write to page 2: no wear.
- Write to page 3: no wear.
- Write to page 4: no wear.
- Write to page 1: must erase, so the entire block is cleared, wear. Pages 2-4 get rewritten with previous data.
- Write to page 2: must erase, so the entire block is cleared, wear. Pages 1,3,4 get rewritten with previous data
- Reference: https://www.reddit.com/r/explainlikeimfive/comments/9pjd4n/comment/e824i0k/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button
Resources¶
- https://www.vastdata.com/whitepaper
- https://techfieldday.com/companies/vast-data/
- https://www.vastdata.com/blog/redefining-simplicity-at-scale-welcome-to-the-asymmetric-era
- https://www.youtube.com/watch?v=1p-AwYBfKC4