File System Evaluation Project

Next-gen HPC file system evaluation and rollout

File storage is a core piece of any HPC deployment, one that nearly every other piece depends upon. We are taking the opportunity with the creation of a new cluster to evaluate and deploy a parallel file system that will be more suited to HPC use, easier to manage, and provide us a path forward as we scale up our infrastructure.

This is a complex task with many requirements and a very high need for reliability. The major phases are as follows:

	Main components	Status
Phase 1: Information gathering	Identify characteristics of current storage in use including size, IO, reliability, and features in use. Identify software candidates for evaluation.	Complete
Phase 2: Feature evaluation	Generate a grid of features and shortcomings for all software candidates. Determine which features are mandatory and which are high priority. From this, choose top software candidates.	Near-complete
Phase 3: Operational testing	For each software candidate, select hardware for a testbed install. Confirm features from phase 2 and evaluate software operation. From this, identify the primary choice for storage software.	Pending phase 2
Phase 4: Final cluster setup	Install and set up hardware, OS, and network. Install and configure storage software. Retest failure modes, ensuring they create actionable alerts and do not cause excessive downtime.	Pending phase 3 and final hardware
Phase 5: Acceptance testing	Perform full-scale performance testing, first with a synthetic load, and then with user loads. Ensure that documentation matches the final deployment.	Pending phase 4

Phase 1: Information gathering

We have approximately 20 petabytes of storage serving thousands of cores of compute capacity. Understanding what our storage is doing is crucial.

Our primary approach for file serving uses NFS v4 as a network protocol, and ZFS as an on-disk filesystem. These are well understood and capable but have become harder to maintain and less flexible as we have grown. Filesystem and protocol features in current use include:

RAIDZ (redundant against triple failure)	Data integrity and availability
Compression	Lowering storage price for end users
Snapshots	Mainly used for management functions such as moving data between servers
Quotas	Directory level, to allow group storage buy-in
Hard mounts	Jobs pause instead of failing if storage becomes unavailable
MPI-IO	Requires increased filesystem consistency between hosts

Three clusters were selected from which to gather representative load and usage information from: Farm, LSSC0, and HPC2. Data was extracted from approximately 90 days of operation to generate the following average figures:

	Farm (ZFS)	LSSC0	HPC2
File servers sampled	10	8	1
Storage spindles	572	772	116
Total storage	5.5PB	5.4PB	650TB
R/W volume per day	27TB/28TB	30TB/15TB	1.5TB/3TB
IO volume per second	6 Gbps	6 Gbps	750 Mbps
IO volume per day / total storage	1%	1%	0.7%
R/W ops per second	2700/700	300/350	100/350
R/W op average size	115KB/450KB	1300KB/570KB	130KB/90KB
NFS IOPS per spindle	6	1	4
Average compression	1.4:1	1.3:1	1.8:1

From this we can conclude:

Our average IO volume is roughly 10TB per day per petabyte (1%)
Our average IO operations are roughly 700 per second per petabyte
Average IO size is moderate to large (>100KB)
Effective compression is mandatory for our data

Our storage hardware purchase includes:

6 storage nodes
6x 792TB (44 spindles) of disk

Adding in a 100% safety factor, we project our new cluster to be:

	New cluster
File servers	6
Storage spindles	264
Total storage (raw)	4.7PB
Total storage (after overhead)	3.5PB
IO volume per day	70TB
R/W ops per second	6000
R/W op average size	100-400KB
Network IOPS per spindle	22
Average compression	1.4:1

The limiting factor here for performance is very likely to be network IOPS per spindle.

Previous evaluations have been focused on Quobyte as a parallel filesystem solution. BeeGFS was researched as an alternative, and two widely used competitors in the HPC space (Lustre and Ceph) have been added to create a comprehensive list for consideration.

Phase 2: Feature evaluation

Our requirements can be divided into features which can be evaluated ahead of install time, and metrics which require a testbed to examine behavior. We first evaluate the features, then top candidates are chosen.

	BeeGFS	Lustre	Quobyte	Ceph
Mandatory: Comparable or superior data integrity to current NFS solution: redundant against triple failure.	Yes	Yes	Yes	Yes	BeeGFS and Lustre: ZFS with raidz3 Ceph: 4 parity disks are necessary for operation when one node is down. 6+4 erasure coding
Mandatory: Comparable or superior availability to current NFS solution	Partially	Partially	Yes	Yes	BeeGFS and Lustre require that each disk is attached to multiple hosts to meet our availability requirements. This gives us N+1 at most levels besides the disk chassis.
Mandatory: Quotas	Yes	Yes	Yes	Yes	Quotas are per user and per group with BeeGFS. Quotas are per directory with Ceph. Current NFS quotas are per directory.
Mandatory: Compression	Yes	Yes	Yes	Yes	Current NFS/ZFS solution averages 1.4:1 (we cannot afford to lose this)
Recommended: Point in time snapshots	No	Yes	Yes	Yes	Required for point in time backup integrity.
Recommended: Storage rebalancing	Partially	Yes	Yes	Yes
Recommended: Central web UI/dashboard	Partially	Yes	Yes	Yes
Underlying storage layer	XFS	ZFS	Proprietary	ceph-volume	In-house experience is primarily with ZFS
Erasure coding across storage hosts	No	No	Yes	Yes
Hierarchical storage management	Unknown	Partially	Unknown	Partially
MPI support?	Yes	Yes	Yes	Unknown	NFS may offer better performance
User-level security and encrypted transport	No	Partially	Partially	Yes	Lustre support is incomplete. Quobyte supports host level security only, no Kerberos.
Encryption at rest	Yes	Yes	Yes	Yes
Substantial community	Yes	Yes	Unknown	Yes

From this feature grid, prioritize the best choices for operational testing. These are ranked here:

Ceph	Highly redundant and full featured. Lower performing than Lustre but offers potentially better reliability.
Quobyte	“All in one” solution. Highly redundant but initial tests showed poor performance, review of configuration is needed. Proprietary storage layer, unified disk management.
Lustre	Uses ZFS as a storage layer which is well understood. Complex but full featured. Likely to be the best performer at scale. Most common HPC filesystem.
BeeGFS	Does not support snapshots. Does not support compression when using XFS. Unlikely to meet needs.

Phase 3: Operational testing

Operational testing at this point is based on these assumptions:

Initial systems will be delivered in the near future (late August)
Storage and network equipment delivery will take an additional 4-8 weeks

In order to deliver a solution as soon as possible once all hardware is in place, we must perform as much testing as possible using available resources.

For each software candidate we must:

Identify an area for install/test
Plan our storage topology
Install software
Validate features listed in the Phase 2 table
Evaluate, score and add notes for each category

	Option 1 (TBD)	Option 2 (TBD)
Lifecycle events: server failure server added or removed
Feature matrix validation?
Disk events: failure and replacement addition and removal
Misc failures: servers, cables, network prolonged outage of storage layer
Storage rebalancing
Compression
Max usable capacity
Management interfaces: “single pane” UI effectiveness? IO analysis and visibility
Performance (small scale evaluation)
Ease of use for technical staff
User testing/review of options

Based on this, identify our primary choice for deployment, and document structure and rollout plan.

Phase 4: Final cluster setup

Hardware arrives and we have our primary software choice ready. Setup includes:

Install and setup of network hardware
Install and setup of hardware and OS on storage nodes
Ensure host monitoring is in place
Baseline performance testing to quantify hardware
Install and configuration of storage software
Setup of monitoring to take storage alerts and to collect statistics
Perform feature testing of critical features from Phase 2
Repeat failure testing at all levels (disk, enclosure, cabling, server, network). Ensure that any downtime is as expected and that alerts are operating and documented.
Ensure documentation of architecture and operation is correct and up to date.

Phase 5: Acceptance testing

This is our final phase. We must ensure these three things before rollout:

Reliability is as expected
Performance is sufficient with our load testing
Responsiveness and overall user experience is good

Reliability is primarily addressed in Phase 4, but we must keep an eye out for anything actionable or unusual during Phase 5.

Performance testing can be done using a synthetic background load and then running A/B tests on the new cluster and an old cluster. The background load should match the worst parts of the IO envelope experienced by Farm and HPC2. This can be produced by using two instances of fio per test client, averaging:

2700/s 120KB read ops
3500/s 90KB write ops

This will produce roughly 320MB/s each of read and write traffic.

A/B testing can be done either with user applications or with synthetic load. Synthetic load should include:

Large (512KB) asynchronous reads
Medium (128KB) asynchronous reads and writes
Small (16KB) synchronous writes, to represent MPI traffic

User experience is best addressed by finding “power users” who can stress the storage effectively, and then having them run real-world processing both in an existing cluster and the new cluster. This processing must be repeatable; in the event of a significant difference we will want to re-run it and gather detailed information.