File System Evaluation Project

Next-gen HPC file system evaluation and rollout

File storage is a core piece of any HPC deployment, one that nearly every other piece depends upon.  We are taking the opportunity with the creation of a new cluster to evaluate and deploy a parallel file system that will be more suited to HPC use, easier to manage, and provide us a path forward as we scale up our infrastructure.

This is a complex task with many requirements and a very high need for reliability.  The major phases are as follows:

 

Main components

Status

Phase 1:

Information gathering

Identify characteristics of current storage in use including size, IO, reliability, and features in use.  Identify software candidates for evaluation.

Complete

Phase 2:

Feature evaluation

Generate a grid of features and shortcomings for all software candidates.  Determine which features are mandatory and which are high priority.  From this, choose top software candidates.

Near-complete

Phase 3:

Operational testing

For each software candidate, select hardware for a testbed install.  Confirm features from phase 2 and evaluate software operation.  From this, identify the primary choice for storage software.

Pending phase 2

Phase 4:

Final cluster setup

Install and set up hardware, OS, and network.  Install and configure storage software.  Retest failure modes, ensuring they create actionable alerts and do not cause excessive downtime.

Pending phase 3 and final hardware

Phase 5:

Acceptance testing

Perform full-scale performance testing, first with a synthetic load, and then with user loads.  Ensure that documentation matches the final deployment.

Pending phase 4

Phase 1: Information gathering

We have approximately 20 petabytes of storage serving thousands of cores of compute capacity.  Understanding what our storage is doing is crucial.

Our primary approach for file serving uses NFS v4 as a network protocol, and ZFS as an on-disk filesystem.  These are well understood and capable but have become harder to maintain and less flexible as we have grown.  Filesystem and protocol features in current use include:

RAIDZ (redundant against triple failure)

Data integrity and availability

Compression

Lowering storage price for end users

Snapshots

Mainly used for management functions such as moving data between servers

Quotas

Directory level, to allow group storage buy-in

Hard mounts

Jobs pause instead of failing if storage becomes unavailable

MPI-IO

Requires increased filesystem consistency between hosts

Three clusters were selected from which to gather representative load and usage information from: Farm, LSSC0, and HPC2.  Data was extracted from approximately 90 days of operation to generate the following average figures:

 

Farm (ZFS)

LSSC0

HPC2

File servers sampled

10

8

1

Storage spindles

572

772

116

Total storage

5.5PB

5.4PB

650TB

R/W volume per day

27TB/28TB

30TB/15TB

1.5TB/3TB

IO volume per second

6 Gbps

6 Gbps

750 Mbps

IO volume per day / total storage

1%

1%

0.7%

R/W ops per second

2700/700

300/350

100/350

R/W op average size

115KB/450KB

1300KB/570KB

130KB/90KB

NFS IOPS per spindle

6

1

4

Average compression

1.4:1

1.3:1

1.8:1

From this we can conclude:

  • Our average IO volume is roughly 10TB per day per petabyte (1%)
  • Our average IO operations are roughly 700 per second per petabyte
  • Average IO size is moderate to large (>100KB)
  • Effective compression is mandatory for our data

Our storage hardware purchase includes:

  • 6 storage nodes
  • 6x 792TB (44 spindles) of disk

Adding in a 100% safety factor, we project our new cluster to be:

 

New cluster

File servers

6

Storage spindles

264

Total storage (raw)

4.7PB

Total storage (after overhead)

3.5PB

IO volume per day

70TB

R/W ops per second

6000

R/W op average size

100-400KB

Network IOPS per spindle

22

Average compression

1.4:1

The limiting factor here for performance is very likely to be network IOPS per spindle.

Previous evaluations have been focused on Quobyte as a parallel filesystem solution. BeeGFS was researched as an alternative, and two widely used competitors in the HPC space (Lustre and Ceph) have been added to create a comprehensive list for consideration.

Phase 2: Feature evaluation

Our requirements can be divided into features which can be evaluated ahead of install time, and metrics which require a testbed to examine behavior.  We first evaluate the features, then top candidates are chosen.

 

BeeGFS

Lustre

Quobyte

Ceph

 

Mandatory:

 

Comparable or superior data integrity to current NFS solution: redundant against triple failure.

Yes

Yes

Yes

Yes

BeeGFS and Lustre:

ZFS with raidz3

 

Ceph:

4 parity disks are necessary for operation when one node is down.

6+4 erasure coding

Mandatory:

 

Comparable or superior availability to current NFS solution

Partially

Partially

Yes

Yes

BeeGFS and Lustre require that each disk is attached to multiple hosts to meet our availability requirements.  This gives us N+1 at most levels besides the disk chassis.

Mandatory:

 

Quotas

Yes

Yes

Yes

Yes

Quotas are per user and per group with BeeGFS.  Quotas are per directory with Ceph.  Current NFS quotas are per directory.

Mandatory:

 

Compression

Yes

Yes

Yes

Yes

Current NFS/ZFS solution averages 1.4:1 (we cannot afford to lose this)

Recommended:

 

Point in time snapshots

No

Yes

Yes

Yes

Required for point in time backup integrity.

Recommended:

 

Storage rebalancing

Partially

Yes

Yes

Yes

 

Recommended:

 

Central web UI/dashboard

Partially

Yes

Yes

Yes

 

Underlying storage layer

XFS

ZFS

Proprietary

ceph-volume

In-house experience is primarily with ZFS

Erasure coding across storage hosts

No

No

Yes

Yes

 

Hierarchical storage management

Unknown

Partially

Unknown

Partially

 

MPI support?

Yes

Yes

Yes

Unknown

NFS may offer better performance

User-level security and encrypted transport

No

Partially

Partially

Yes

Lustre support is incomplete.  Quobyte supports host level security only, no Kerberos.

Encryption at rest

Yes

Yes

Yes

Yes

 

Substantial community

Yes

Yes

Unknown

Yes

 

From this feature grid, prioritize the best choices for operational testing.  These are ranked here:

Ceph

Highly redundant and full featured.  Lower performing than Lustre but offers potentially better reliability. 

Quobyte

“All in one” solution.  Highly redundant but initial tests showed poor performance, review of configuration is needed.  Proprietary storage layer, unified disk management.

Lustre

Uses ZFS as a storage layer which is well understood.  Complex but full featured.  Likely to be the best performer at scale.  Most common HPC filesystem.

BeeGFS

Does not support snapshots.  Does not support compression when using XFS.  Unlikely to meet needs.

Phase 3: Operational testing

Operational testing at this point is based on these assumptions:

  • Initial systems will be delivered in the near future (late August)
  • Storage and network equipment delivery will take an additional 4-8 weeks

In order to deliver a solution as soon as possible once all hardware is in place, we must perform as much testing as possible using available resources.

For each software candidate we must:

  • Identify an area for install/test
  • Plan our storage topology
  • Install software
  • Validate features listed in the Phase 2 table
  • Evaluate, score and add notes for each category
 

Option 1 (TBD)

Option 2 (TBD)

Lifecycle events:

  server failure

  server added or removed

   

Feature matrix validation?

   

Disk events:

  failure and replacement

  addition and removal

   

Misc failures:

  servers, cables, network

  prolonged outage of storage layer

   

Storage rebalancing

   

Compression

   

Max usable capacity

   

Management interfaces:

  “single pane” UI effectiveness?

  IO analysis and visibility

   

Performance (small scale evaluation)

   

Ease of use for technical staff

   

User testing/review of options

   

Based on this, identify our primary choice for deployment, and document structure and rollout plan.

Phase 4: Final cluster setup

Hardware arrives and we have our primary software choice ready.  Setup includes:

  • Install and setup of network hardware
  • Install and setup of hardware and OS on storage nodes
  • Ensure host monitoring is in place
  • Baseline performance testing to quantify hardware
  • Install and configuration of storage software
  • Setup of monitoring to take storage alerts and to collect statistics
  • Perform feature testing of critical features from Phase 2
  • Repeat failure testing at all levels (disk, enclosure, cabling, server, network).  Ensure that any downtime is as expected and that alerts are operating and documented.
  • Ensure documentation of architecture and operation is correct and up to date.

Phase 5: Acceptance testing

This is our final phase.  We must ensure these three things before rollout:

  • Reliability is as expected
  • Performance is sufficient with our load testing
  • Responsiveness and overall user experience is good

Reliability is primarily addressed in Phase 4, but we must keep an eye out for anything actionable or unusual during Phase 5.

Performance testing can be done using a synthetic background load and then running A/B tests on the new cluster and an old cluster.  The background load should match the worst parts of the IO envelope experienced by Farm and HPC2.  This can be produced by using two instances of fio per test client, averaging:

  • 2700/s 120KB read ops
  • 3500/s 90KB write ops

This will produce roughly 320MB/s each of read and write traffic.

A/B testing can be done either with user applications or with synthetic load.  Synthetic load should include:

  • Large (512KB) asynchronous reads
  • Medium (128KB) asynchronous reads and writes
  • Small (16KB) synchronous writes, to represent MPI traffic

User experience is best addressed by finding “power users” who can stress the storage effectively, and then having them run real-world processing both in an existing cluster and the new cluster.  This processing must be repeatable; in the event of a significant difference we will want to re-run it and gather detailed information.