Next-gen HPC file system evaluation and rollout
File storage is a core piece of any HPC deployment, one that nearly every other piece depends upon. We are taking the opportunity with the creation of a new cluster to evaluate and deploy a parallel file system that will be more suited to HPC use, easier to manage, and provide us a path forward as we scale up our infrastructure.
This is a complex task with many requirements and a very high need for reliability. The major phases are as follows:
Main components |
Status |
|
Phase 1: Information gathering |
Identify characteristics of current storage in use including size, IO, reliability, and features in use. Identify software candidates for evaluation. |
Complete |
Phase 2: Feature evaluation |
Generate a grid of features and shortcomings for all software candidates. Determine which features are mandatory and which are high priority. From this, choose top software candidates. |
Near-complete |
Phase 3: Operational testing |
For each software candidate, select hardware for a testbed install. Confirm features from phase 2 and evaluate software operation. From this, identify the primary choice for storage software. |
Pending phase 2 |
Phase 4: Final cluster setup |
Install and set up hardware, OS, and network. Install and configure storage software. Retest failure modes, ensuring they create actionable alerts and do not cause excessive downtime. |
Pending phase 3 and final hardware |
Phase 5: Acceptance testing |
Perform full-scale performance testing, first with a synthetic load, and then with user loads. Ensure that documentation matches the final deployment. |
Pending phase 4 |
Phase 1: Information gathering
We have approximately 20 petabytes of storage serving thousands of cores of compute capacity. Understanding what our storage is doing is crucial.
Our primary approach for file serving uses NFS v4 as a network protocol, and ZFS as an on-disk filesystem. These are well understood and capable but have become harder to maintain and less flexible as we have grown. Filesystem and protocol features in current use include:
RAIDZ (redundant against triple failure) |
Data integrity and availability |
Compression |
Lowering storage price for end users |
Snapshots |
Mainly used for management functions such as moving data between servers |
Quotas |
Directory level, to allow group storage buy-in |
Hard mounts |
Jobs pause instead of failing if storage becomes unavailable |
MPI-IO |
Requires increased filesystem consistency between hosts |
Three clusters were selected from which to gather representative load and usage information from: Farm, LSSC0, and HPC2. Data was extracted from approximately 90 days of operation to generate the following average figures:
Farm (ZFS) |
LSSC0 |
HPC2 |
|
File servers sampled |
10 |
8 |
1 |
Storage spindles |
572 |
772 |
116 |
Total storage |
5.5PB |
5.4PB |
650TB |
R/W volume per day |
27TB/28TB |
30TB/15TB |
1.5TB/3TB |
IO volume per second |
6 Gbps |
6 Gbps |
750 Mbps |
IO volume per day / total storage |
1% |
1% |
0.7% |
R/W ops per second |
2700/700 |
300/350 |
100/350 |
R/W op average size |
115KB/450KB |
1300KB/570KB |
130KB/90KB |
NFS IOPS per spindle |
6 |
1 |
4 |
Average compression |
1.4:1 |
1.3:1 |
1.8:1 |
From this we can conclude:
- Our average IO volume is roughly 10TB per day per petabyte (1%)
- Our average IO operations are roughly 700 per second per petabyte
- Average IO size is moderate to large (>100KB)
- Effective compression is mandatory for our data
Our storage hardware purchase includes:
- 6 storage nodes
- 6x 792TB (44 spindles) of disk
Adding in a 100% safety factor, we project our new cluster to be:
New cluster |
|
File servers |
6 |
Storage spindles |
264 |
Total storage (raw) |
4.7PB |
Total storage (after overhead) |
3.5PB |
IO volume per day |
70TB |
R/W ops per second |
6000 |
R/W op average size |
100-400KB |
Network IOPS per spindle |
22 |
Average compression |
1.4:1 |
The limiting factor here for performance is very likely to be network IOPS per spindle.
Previous evaluations have been focused on Quobyte as a parallel filesystem solution. BeeGFS was researched as an alternative, and two widely used competitors in the HPC space (Lustre and Ceph) have been added to create a comprehensive list for consideration.
Phase 2: Feature evaluation
Our requirements can be divided into features which can be evaluated ahead of install time, and metrics which require a testbed to examine behavior. We first evaluate the features, then top candidates are chosen.
BeeGFS |
Lustre |
Quobyte |
Ceph |
||
Mandatory: Comparable or superior data integrity to current NFS solution: redundant against triple failure. |
Yes |
Yes |
Yes |
Yes |
BeeGFS and Lustre: ZFS with raidz3 Ceph: 4 parity disks are necessary for operation when one node is down. 6+4 erasure coding |
Mandatory: Comparable or superior availability to current NFS solution |
Partially |
Partially |
Yes |
Yes |
BeeGFS and Lustre require that each disk is attached to multiple hosts to meet our availability requirements. This gives us N+1 at most levels besides the disk chassis. |
Mandatory: Quotas |
Yes |
Yes |
Yes |
Yes |
Quotas are per user and per group with BeeGFS. Quotas are per directory with Ceph. Current NFS quotas are per directory. |
Mandatory: Compression |
Yes |
Yes |
Yes |
Yes |
Current NFS/ZFS solution averages 1.4:1 (we cannot afford to lose this) |
Recommended: Point in time snapshots |
No |
Yes |
Yes |
Yes |
Required for point in time backup integrity. |
Recommended: Storage rebalancing |
Partially |
Yes |
Yes |
Yes |
|
Recommended: Central web UI/dashboard |
Partially |
Yes |
Yes |
Yes |
|
Underlying storage layer |
XFS |
ZFS |
Proprietary |
ceph-volume |
In-house experience is primarily with ZFS |
Erasure coding across storage hosts |
No |
No |
Yes |
Yes |
|
Hierarchical storage management |
Unknown |
Partially |
Unknown |
Partially |
|
MPI support? |
Yes |
Yes |
Yes |
Unknown |
NFS may offer better performance |
User-level security and encrypted transport |
No |
Partially |
Partially |
Yes |
Lustre support is incomplete. Quobyte supports host level security only, no Kerberos. |
Encryption at rest |
Yes |
Yes |
Yes |
Yes |
|
Substantial community |
Yes |
Yes |
Unknown |
Yes |
From this feature grid, prioritize the best choices for operational testing. These are ranked here:
Ceph |
Highly redundant and full featured. Lower performing than Lustre but offers potentially better reliability. |
Quobyte |
“All in one” solution. Highly redundant but initial tests showed poor performance, review of configuration is needed. Proprietary storage layer, unified disk management. |
Lustre |
Uses ZFS as a storage layer which is well understood. Complex but full featured. Likely to be the best performer at scale. Most common HPC filesystem. |
BeeGFS |
Does not support snapshots. Does not support compression when using XFS. Unlikely to meet needs. |
Phase 3: Operational testing
Operational testing at this point is based on these assumptions:
- Initial systems will be delivered in the near future (late August)
- Storage and network equipment delivery will take an additional 4-8 weeks
In order to deliver a solution as soon as possible once all hardware is in place, we must perform as much testing as possible using available resources.
For each software candidate we must:
- Identify an area for install/test
- Plan our storage topology
- Install software
- Validate features listed in the Phase 2 table
- Evaluate, score and add notes for each category
Option 1 (TBD) |
Option 2 (TBD) |
|
Lifecycle events: server failure server added or removed |
||
Feature matrix validation? |
||
Disk events: failure and replacement addition and removal |
||
Misc failures: servers, cables, network prolonged outage of storage layer |
||
Storage rebalancing |
||
Compression |
||
Max usable capacity |
||
Management interfaces: “single pane” UI effectiveness? IO analysis and visibility |
||
Performance (small scale evaluation) |
||
Ease of use for technical staff |
||
User testing/review of options |
Based on this, identify our primary choice for deployment, and document structure and rollout plan.
Phase 4: Final cluster setup
Hardware arrives and we have our primary software choice ready. Setup includes:
- Install and setup of network hardware
- Install and setup of hardware and OS on storage nodes
- Ensure host monitoring is in place
- Baseline performance testing to quantify hardware
- Install and configuration of storage software
- Setup of monitoring to take storage alerts and to collect statistics
- Perform feature testing of critical features from Phase 2
- Repeat failure testing at all levels (disk, enclosure, cabling, server, network). Ensure that any downtime is as expected and that alerts are operating and documented.
- Ensure documentation of architecture and operation is correct and up to date.
Phase 5: Acceptance testing
This is our final phase. We must ensure these three things before rollout:
- Reliability is as expected
- Performance is sufficient with our load testing
- Responsiveness and overall user experience is good
Reliability is primarily addressed in Phase 4, but we must keep an eye out for anything actionable or unusual during Phase 5.
Performance testing can be done using a synthetic background load and then running A/B tests on the new cluster and an old cluster. The background load should match the worst parts of the IO envelope experienced by Farm and HPC2. This can be produced by using two instances of fio per test client, averaging:
- 2700/s 120KB read ops
- 3500/s 90KB write ops
This will produce roughly 320MB/s each of read and write traffic.
A/B testing can be done either with user applications or with synthetic load. Synthetic load should include:
- Large (512KB) asynchronous reads
- Medium (128KB) asynchronous reads and writes
- Small (16KB) synchronous writes, to represent MPI traffic
User experience is best addressed by finding “power users” who can stress the storage effectively, and then having them run real-world processing both in an existing cluster and the new cluster. This processing must be repeatable; in the event of a significant difference we will want to re-run it and gather detailed information.