Distributed File Systems (DFS)
WEKA. April 27, 2021
What is a Distributed File System?
A distributed file system (DFS) differs from typical file systems (i.e., NTFS and HFS) in that it allows direct host access to the same file data from multiple locations. Indeed, the data behind a DFS can reside in a different location from all of the hosts that access it.
Features of Distributed File System
- Transparent local access – From a host perspective, the data is accessed as if it’s local to the host that is accessing it.
- Location independence – Hosts may have no idea where file data physically resides. Data location is managed by the DFS and not by the host accessing it.
- Coherent access – DFS file data is managed so that it appears to the host(s) as if it’s all within a single file system, even though its data could be distributed across many storage devices/servers and locations.
- Great large-file streaming throughput – DFS systems emerged to supply high-streaming performance for HPC workloads, and most continue to do so.
- File locking – DFSs usually support file locking across or within locations, which ensures that no two hosts can modify the same file at the same time.
- Data-in-flight encryption – Most DFS systems support encrypting data and metadata while it is in transit.
- Diverse storage media/systems – Most DFS systems can make use of spinning disk, SAS SSDs, NVMe SSDs, and S3 object storage, as well as private, on-premises object storage to hold file data. While most DFS systems have very specific requirements for metadata servers, their data or file storage can often reside on just about any storage available, including the public cloud.
- Multi-protocol access – Hosts can access DFS data using standard NFS, SMB, or a POSIX client supplied by the solution provider. Occasionally one can also see NVMe-oF for files and (NVIDIA) GPU Direct Storage access. This can also mean that the same file can be accessed with all protocols the DFS solution supports.
- Multi-networking access – While all DFSs provide Ethernet access to file system data, some also provide InfiniBand and other high performance networking access.
- Local gateways – DFS systems may require some server and storage resources at each location that has access to its file data. Local gateways often cache metadata and data referenced by hosts. Gateways like this can typically be scaled up or down to sustain performance requirements. In some cases where access and data reside together, gateways are not needed.
- Software-defined solutions – Given all the above, most DFS systems are software defined solutions. Some DFS systems are also available in appliance solutions, but that is more for purchasing/deployment convenience than a requirement of the DFS solution.
- Scale-out storage solution – Most DFS systems support scale-out file systems in which file data and metadata service performance and capacity can be increased by adding more metadata or file data server resources–which includes gateways.
Characteristics of a Modern Distributed File System
- High IOPS/great small-file performance – Some DFS systems support very high IOPS for improved small file performance.
- Cross-protocol locking – Some DFS systems allow for one protocol to lock a file while being modified by another protocol. This feature prohibits a file from being corrupted by multi-host access even when accessing the file with different protocols.
- Cloud resident services – Some DFS solutions can run in a public cloud environment. That is, their file data storage, metadata services, and any monitoring/management services all run in a public cloud provider. File data access can then take place all within the same cloud AZ or across cloud regions or even on premises with access to that cloud data.
- High availability support – Some DFS systems also support very high availability by splitting and replicating their control, metadata, and file data storage systems across multiple sites, AZs or servers.
- (File) data reduction – Some DFS solutions support data compression or deduplication designed to reduce the physical data storage space required to store file data.
- Data-at-rest encryption – Some DFS systems offer encryption of file data and metadata at rest.
Single name space – Some DFS systems provide the ability to stitch multiple file systems/shares into a single name space, which can be used to access any file directory being served.
- Geo-fencing – Some DFS systems can limit or restrict the physical locations in which data can reside and from which it can be accessed. This capability can be required to support GDPR and other legal restrictions on data movement.
Advantages of Distributed File System
There are several reasons why one might consider a DFS solution for their environment, but they all boil down to a need to access the same data from multiple locations. This need could be due to requiring support for multiple sites processing the same data. An example of a team with such a need would be a multi-site engineering team that has compute resources local to their environment and that has a group of engineers who all participate in different phases of engineering a system while using the same data. It could also be due to cloud bursting, where workloads move from on-premises to in-cloud or even to other AZs within a cloud to gain access to more processing power. Finally, another example might include any situation in which one wants to utilize a hybrid cloud solution that requires access to the same data. All these can benefit from the use of a DFS.
Yes, these capabilities could all be accomplished in some other way, such as copying/moving data from one site/cloud to another. But that takes planning, time, and manual, error-prone procedures to make it happen. A DFS can do all of this for you–automatically–without any of the pain.
Disadvantages of Distributed File System
If data resides on cloud storage and is consumed elsewhere, cloud egress charges may be significant. This cost can be mitigated somewhat by data reduction techniques and the fact that DFS will move only data that is being accessed, but that may still leave a significant egress expense.
While DFS systems can cache data to improve local access, none can defeat the speed of light. That is, the access to the first non-cached byte of data may take an inordinate amount of time depending on how far it is from where it is being accessed. DFS solutions can minimize this overhead by overlapping cached data access while fetching the next portion of data, but doing this will not always mask the network latency required to access a non-cached portion of the data.
DFS systems can sometimes be complex to deploy. With data that can reside just about anywhere with hosts that can access all of it from anywhere else, getting all of this to work together properly and tuning it for high performance can be a significant challenge. Vendor support can help considerably. DFS vendors will have professional services that can be used to help deploy and configure their systems to get them up and running in a timely fashion. Also, they will have sophisticated modeling that can tell them how much gateway, metadata, and storage resources will be required to support your performance.
It all boils down to this: DFS systems offer global access to the same data that is very difficult to accomplish effectively in any other way, especially when you have multiple sites, all computing with and consuming the same data. DFS systems can make all of this look easy and seamless, without breaking a sweat. So, if you have multiple sites with a need to access the same data, a DFS system can be a godsend.
Additional Helpful Resources
Lustre File System Explained
General Parallel File System (GPFS) Explained
BeeGFS Parallel File System Explained
FSx for Lustre
What is Network File System?
Network File System (NFS) and AI Workloads
Block Storage vs. Object Storage
Introduction to Hybrid Cloud Storage
Learn About HPC Storage, HPC Storage Architecture and Use Cases
NAS vs. SAN vs. DAS
Isilon vs. Flashblade vs. Weka
5 Reasons Why IBM Spectrum Scale is Not Suitable for AI Workloads
Redefining Scale for Modern Storage
Distributed Data Protection