This series of posts is written with the assumption that the reader has some basic knowledge of vSAN and storage technologies but not deep expertise. The intention is not to review feature functions of vSAN but to provide guidelines for selecting appropriate hardware for a vSAN solution for those that do not have a high comfort level in doing so. Design errors in storage can be very costly so the effort here is to get it right the first time.
The Basics
Hardware Compatibility.
VMware Compatibility Guide for vSAN
If the hardware selected for the vSAN solution is not in the vSAN VCG, then there should be concern as to whether the solution is production grade. For vSAN to be supported properly by VMware, the components must be validated as compatible. There should be no other alternative when designing a production ready vSAN cluster.
Work with your hardware vendor to determine the proper certified components.
A vSAN Ready Node is a preconfigured vSAN solution that has had it’s components validated for vSAN. HPE, Dell, SuperMicro, Fujitsu, Lenovo and more offer vSAN Ready Nodes. vSAN Ready nodes are an easy choice and are an ideal starting point as long as they can be customized to fit your needs. Work with the vendor of choice to customize.
Want a deeper dive into vSAN Ready Node models and descriptions? Visit the following links:
What You Can (and Cannot) Change in a vSAN ReadyNode™ (52084)
vSAN Hardware Quick Reference Guide
Workload Assessment
This section has little direct content addressing the hardware selection, but one cannot design a solution without accurately defining the problem to be solved. This is true when designing any solution in the data center. The workload to be supported determines the appropriate hardware. Workload is king and should be known before starting with any design decisions. Since we are addressing vSAN, there is a high likelihood that this is an all VMware environment.
Designing any storage solution starts with knowing:
- Read/write ratio
- IO size
- Sequential or random IO
- Throughput
- Application requirements
- OS requirements
- Year over year growth expectations (capacity and performance)
Know the application set to be serviced. This is how one can qualify the solution. Is the workload an archive application with a requirement for 5PB of storage replicated to multiple locations? Does it require extreme performance with sub millisecond latency? There are many considerations so to start with, the SA should ask themselves if they have selected the proper platform for such a workload. If you don’t have any assessment tools, check out LiveOptics. We’re not here to provide an overview of LiveOptics so follow the link for more info. It has a solid reputation for providing usable sizing information and can export data to a vSAN sizer.

*These are your formulas for determining your workload if you do not have access to a detailed analysis tool. One can derive the third piece of info if you have the other two.

Know the expected growth year over year and design the solution to be easily expanded. The design and configuration should be good for at least 3-5 years without major unplanned revision. This means the plan for expansion should be able to address normal and foreseeable growth without extreme redesign. The person responsible for backup administration is usually as good source for providing information on year over year data growth requirements.
What about performance growth trending? In the past 20 years or so of building different storage solutions, I’ve had access to this info a couple of time, maybe. Assume there is a relationship between capacity and performance requirements and they will grow simultaneously.
Historically, designing the solution using the 95% IOPS has been reliable, but granularity of sample interval should be considered. Is the sample interval every minute? 5 minutes? 10? A lot of information is lost the longer the time between samples. 1-2 minutes usually suffices in my experience. I’ve sized for performance using an assessment with a 5-minute sample interval in the past and missed some serious performance spikes that impacted overall design. It was not an ideal situation for my customer. Get granular if possible or risk improper design.
Read/Write ratio will influence your capacity! Most storage architects and admins are already aware of this. vSAN, being a distributed storage solution, the backend IO traffic being impacted by protection method write amplification still apply. RAID 1 usually performs better with heavier write workloads but with enhancements to newer versions of vSAN, the performance gap is closing where RAID 5/6 may be able to satisfy a greater range of workloads than those that are read heavy. See this for more details. RAID-5/6 Erasure Coding Enhancements in vSAN 7 U2
- RAID-1 Mirroring (FTT1) – 2x I/Os
- 2 Writes
- RAID-5 – 4x I/Os
- Read old data
- Read old parity
- Write new data
- Write new parity
- RAID-6 – 6x I/Os
- Read old data
- Read old parity
- Read old parity
- Write new data
- Write new parity
- Write new parity
Getting back to R/W ratio impacting capacity, below, we see the capacity consumed by the various Fault Tolerance Methods in vSAN. Read heavy workloads are generally less impacted by larger protection set like RAID 5/6.
- RAID 1 FTT1-2x space consumption
- RAID 1 FTT2-3x space consumption
- RAID 1 FTT3-4x space consumption
- RAID 5 FTT1-1.33x space consumption
- RAID 6 FTT2-1.5x space consumption
So, a write heavy workload may very well consume more space than the read heavy workload due to RAID level requirements to service the workload properly.
Dell has done some testing on vSAN 7 performance improvements and it is worth a review. Harnessing the Performance of Dell EMC VxRail 7.0.100—A Lab-Based Performance Analysis
Benchmark testing vs real workloads
If you want to understand EXACTLY how the solution will run in your environment with your application, run a PoC with your VMs and workloads. Nothing beats real world testing and there is no substitute for seeing your workload run under the stress of users, backup, upgrades, or other Day 2 Operations to determine if the solution really works for you. We see benchmark tools run on vSAN, other HCI solutions, other arrays and declare the fastest to be the best without considering things like upgrades, operations, additional training requirements. The solution does not have to be the fastest or scale to be the biggest. It must be fast enough, scalable enough and easy to manage.
Summary
- Know your workload
- Know your Year over Year growth expectations
- Know the applications to support and their requirements
- Know how workload IO profile impacts capacity planning
- There is no substitute for a PoC with real workloads
