Genomic Sequencing and Role of Parallel File System
Shimon Ben David. February 25, 2020
Shimon Ben David, Field Chief Technology Officer at WekaIO, shares his perspective on the company’s file system through an imaginary episode of The X-Files titled “The X-Files: The Truth is Out There.”
Imagine the following unscreened episode of The X-Files:
Agent Mulder and Agent Scully, following an anonymous tip, arrive at a remote location in the Nevada desert and find an abandoned spaceship. “Let’s go in!” says Mulder. With a nod from Scully, both agents enter the UFO and search for clues to its origin. Agent Scully soon discovers a holographic screen that activates when she touches it. After briefly looking through its contents, her eyes go wide, and she says, “Look at this, Mulder! It seems like they were trying to create a catalog of all the living species on Earth. They’re using technology that’s unlike anything I’ve ever seen”. “Let’s try to find out what kind of advanced technology allows them to accomplish all of that,” says Mulder as he opens a door marked (strangely enough in English) “Engine Room”. Upon opening the door, smoke billows out. “This looks like my car’s engine,” says Mulder. As the smoke clears, both agents notice a little green alien seated by the engine. The alien looks at the two and says (again, strangely enough) in English, “Once we started sequencing the genomes of Earth’s species, the engine couldn’t keep up and we crashed. It worked well for the last 20 years on other planets so we never really thought about replacing the engine…”
This story might sound absurd, but it’s actually illustrating a problem that many scientific and pharmaceutical institutes are encountering today. Technology has made hyper advancements in the last twenty years, with processing units such as CPUs, GPUs, FPGAs, and other accelerators doubling their density and performance every 18 months. Networking has also fundamentally grown and reinvented itself over the last twenty years. Take a look at the progression of Ethernet, which started out at 100 Mbs. Today, at 100 Gbs, it‘s the mainstream backbone in data centers, leaving Fibre Channel behind because it’s too complicated, expensive, and slow. Infiniband, which won over many HPC data centers, now supports HDR (200 Gbs) and soon NDR (400 Gbs), while 20 years ago, it only supported SDR (2.5Gbs). Hard drives have been replaced by SSDs, and even protocols have changed with the “good old” SCSI protocol having been replaced by SAS and then by NVMe. Even with all of this, some areas still haven’t really changed in the last 20 years.
Storage Requirements for Genomic Data Sequencing
New technical improvements have allowed researchers to be able to constantly improve and advance in life science areas. In genomic sequencing, enormous progress has been made since the 1970s; for example, techniques such as classical shotgun sequencing have been improved by short-read and long-read sequencing. Today, even Cryo-EM is becoming more and more popular. With all of these advancements, it was inevitable that it would cause strain on some of the legacy parts in the architecture.
Genomic sequencers by companies such as Illumina, Thermo Fisher, and Agilent are being asked to produce better and faster results, with the output format consisting of anywhere between millions of small files to thousands of big files per sequence. Following the sequencing GATK (Genome Analysis Toolkit) workflows, numerous applications such as Relion, Parabricks, Clara, Gaussian, and Schroedinger as well as many self-developed software tools analyze the flood of data and turn it into usable, meaningful information.
This all begs the question, “What about replacing the engine?” The same underlying storage infrastructure (hardware, protocols, and algorithms) that worked well all these years past cannot handle today’s new data types.
Genomic Data Sequencing with Parallel File System
New storage environments need to allow for faster access to billions of files, whether small, large, or HUGE, all in the same namespace and often in the same directory. Storage environments need to be able to provide microsecond latency (which is incredibly important to CPU bounded workloads such as BLC2FastQ conversion) while also being cost-effective and SIMPLE TO MANAGE so that researchers can just concentrate on research.
The WekaFSTM file system was uniquely designed with these requirements in mind and has been proven at multiple organizations that it is a solution that is easy to manage and can be used to speed up their research and time-to-market. Some relevant use cases that were accelerated simply by placing WekaFS as the underlying file system are Cryo-EM, bcl2fastq conversions, GATK workflows, and GPU-accelerated medical imaging environments.
Now, let’s get back to that scene from The X-Files. Agent Mulder looks at the green alien and slowly moves his hand behind his back like he’s reaching for his gun. Instead, in one quick motion, he takes out his wallet and hands over a card, “This is the number of my WekaIO partner. Call them — the truth is out there…”
To learn more about WekaIO’s solutions for Life Sciences and Research, click here.