BeeGFS Parallel File System Explained
Barbara Murphy. August 25, 2020
What is BeeGFS?
BeeGFS is a parallel clustered file system, developed with a strong focus on performance and designed for very easy installation and management. It originated as an internal program at the Fraunhofer Center for HPC in 2005 and was originally known as the Fraunhofer filesystem.
If I/O intensive workloads are your problem, BeeGFS is often proposed as a solution because of its parallelism. A BeeGFS based storage system is currently ranked #7 on the IO500 behind Lustre, WekaFS and Intel DAOS
BeeGFS transparently spreads user data across multiple servers. By increasing the number of servers and disks in the system, you can simply scale performance and capacity of the file system to the level that you need, seamlessly from small clusters up to enterprise-class systems with thousands of nodes. Similar to the Lustre file system, BeeGFS separates data services and metadata services. When a client has received the metadata information from the metadata servers, it can directly access the data. Unlike traditional NAS systems, this provides for higher performance.
Disadvantages of BeeGFS
BeeGFS is an open source project which is designed to cater to academic HPC environments, but it lacks many of the features required in an enterprise environment. The following provides a summary of limitations that BeeGFS suffers from
Does not support any kind of data protection such as erasure coding or distributed RAID.
Does not have file encryption, at rest or on-the-fly
No native support NVMe-over-Fabric. Need to pay extra for 3rd-party NVMe-over-Fabric layer
Needs separate separate management and metadata servers
Limited by legacy storage interfaces such as SAS, SATA, FC
Does not support enterprise features such as snapshots, backup, data tiering,
Does not support enterprise protocols such as NFS or SMB (requires separate services)
BeeGFS & AI
As noted previously, BeeGFS separates data and metadata into separate services allowing HPC clients to communicate directly with the storage servers. This was a common practise for parallel file systems developed in the past and is similar to both Lustre and IBM Spectrum Scale (GPFS). While separating data and metadata services was a significant improvement for large file I/O, it created a scenario where the metadata services then became the bottleneck. Newer workloads in AI and machine learning (ML) are very demanding on metadata services and many of the files are very tiny (4KB or below), consequently the metadata server is often the performance bottleneck and users will not enjoy the design benefits of a parallel file system like BeeGFS. Studying the IO500 numbers for BeeGFS, it is evident that it could not hit high IOPS performance, achieving a lower number on the md test (metadata test) than on the bw test (bandwidth test).
AI and ML workloads also require small file access with extreme low latency, unfortunately BeeGFS does not have support for new network protocols like NVMe-over-fabrics or NVIDIA® GPUDirect® Storage which deliver extremely low latency to GPU based systems. The result is that expensive GPU resources are starved of I/O resulting in long epoch times and inefficient utilization of very expensive GPU resources.
Additionally, most main-stream enterprise customers expect a certain level of data protection that BeeGFS was never designed for. BeeGFS is commonly referred to as a scratch-space file system, which means if there is a major crash then the analysis is simply restarted with no consideration for data protection. For many ML use cases, the cost of data acquisition is so high that it has to be fully protected. Imagine if the entire training set for an autonomous vehicle was lost? It would take millions of dollars and many man years to replace. Consequently enterprise customers look for some table stakes features that BeeGFS does not offer.
Some common enterprise tasks that are not possible with BeeGFS are,
- User authentication – imagine if a disgruntled employee deleted a whole training set – it happens
- Snapshots – Commonly are used as a way to save specific training runs for comparison with others
- Backup – Immutable copies of data that can be retrieved at a later date
- Backup – Saving data from major disaster and ensuring it can be recovered from
- Encryption – Protect sensitive data (maybe patient MRI or XRay) from threat or rogue actors
- Containerization – Integrate with container services for stateful storage
- Quotas – Ensure groups are not consuming excessive storage services due to bad practises
Comparing Parallel File Systems BeeGFS vs. WekaFS
|Architecture||ThinkParq (BeeGFS)||WekaIO (WekaFS)|
|Small Footprint Configuration||5 servers in 9RU||8 servers in 4RU|
|# of Server Nodes||2 to hundreds||8 to Thousands|
|Supported Storage Interfaces||Legacy SAS, SATA, FC||Natively NVMe|
|NVMe over Fabric||3rd-Party Add-on||Built-In|
|Optimized for Mixed Workloads||No||Yes|
|GPU Direct Storage||No||Yes|
|SMB||No||Yes, SMB 2.1|
|Directories per Directory||No Data from Vendor||6.4T|
|Files per Directory||No Data from Vendor||6.4B|
|File Size||No Data from Vendor||4PB|
|Filesystem Size||No Data from Vendor||8EB (512PB on Flash)|
|Snapshots||No Data from Vendor||Thousands|
|CSI Plugin for Kubernetes||No||Yes|
|Data Encryption||No||At-Rest and In-Flight|
|Read Throughput||25.2GB/s, 20 servers||56GB/s, 8 Servers|
|Write Throughput||24.8GB/s, 20 servers||20GB/s, 8 Servers|
|Read IOPS||No Data from Vendor||5.8M|
|Write IOPS||No Data from Vendor||1.6M|
|Single Mount Point, Full Coherency||No Data from Vendor||82GB/s|
|#1 on IO500 and SPEC||No||Yes|
Learn how Weka’s parallel file system delivers the highest performance for the most data-intensive workloads.