An End-to-end Genomics Solution or Just More Infrastructure?
David Hiatt. May 23, 2019
Do you have an end-to-end genomics solution or just more infrastructure? It depends on whether the individual products have been tested together and deliver a holistic value that is greater than the sum of the component parts.
When speaking to prospective customers, vendors frequently refer to their products as “solutions.” While this is common practice in the IT industry, it is really a misnomer because to most customers, virtually all of these “solutions” are merely point products. Of course, it also depends on the question being asked. If the question is narrowly focused, then yes, a particular SSD or software tool could be a solution. However, from an end-user’s perspective, most of these are not a complete solution. So, what is the definition of a solution, and why is it important?
A true solution needs to address all aspects of a particular problem. For example, the life sciences industry struggles with both storing and managing massive amounts of data and efficiently analyzing it. A storage product such as WekaIO’s Matrix™ addresses the storage aspect but not the analysis part. Based on that, Matrix alone would not be a solution. Similarly, the analytical tools important to making sense of the data are not a solution either. A mashup of point products is also not a solution. Products need to be integrated, tested, and supported by each of the participating vendors.
At BioIT this year, WekaIO demonstrated a comprehensive solution that can be used for secondary analysis of genomic data sequencing. Secondary analysis is a 3-step process in which data from sequenced genomes coming off the sequencer is read and aligned, compared to a reference genome, and analyzed for any variances. This is very time consuming, I/O intensive work, especially when sequencing multiple genomes.
The solution combined best of breed products from WekaIO, Western Digital Corporation (WDC), PetaGene, and Sentieon. WekaIO provided the Matrix high-performance parallel file system with a flash primary storage tier for active data; WDC’s ActiveScale™ provided a cost-effective capacity tier for long-term storage and data protection; PetaGene’s PetaSuite™ provided best-in-class, lossless data compression technology; and Sentieon provided its award-winning, highly optimized DNASeq™ secondary analysis pipeline.
Demonstrating an End-to-end Solution
The demonstration showcased the interoperability of these products using actual genomic data in a real-world simulation. First, the source FASTQ file was ingested into the cluster and transparently compressed using PetaSuite, meaning that there is no impact on existing workflows. The compressed file is then made available to the DNASeq pipeline for processing. The ability to work on compressed data reduces processing time and resource requirements. WekaIO’s Matrix filesystem significantly accelerates processing by eliminating storage bottlenecks that constrict the flow of data to the analytical pipeline. DNASeq is a highly optimized pipeline based on the Broad Institute’s Genomics Analytical Tool Kit (GATK) gold standard.
After the data is analyzed, it is retained for long periods of time, sometimes indefinitely, requiring a cost-effective means of storing and protecting the data. This is especially challenging as the processed data accumulates. WDC’s ActiveScale object storage is ideally suited for this with data durability of 19-nines and geo-replication across up to three sites for hundreds of petabytes. The figure below shows the architecture of the demo system.
No Modifications to Established Workflows
Solution Value Stacking
Creating an end-to-end genomics solution requires careful component selection in order to create a holistic value that is greater than the mere sum of the parts. The system constructed here provides synergistic value because all its components have been tested together, it requires no changes to existing workflows, and it is built using best-of-breed industry standard genomics tools to ensure accurate and consistent results. The result is a flexible, fully configurable, scalable secondary analysis platform that is far more cost-efficient and performant than other solutions, allowing investigators to complete more research in less time.