An End-to-end Genomics Solution or Just More Infrastructure?

David Hiatt. May 23, 2019
An End-to-end Genomics Solution or Just More Infrastructure?

Do you have an end-to-end genomics solution or just more infrastructure?  It depends on whether the individual products have been tested together and deliver a holistic value that is greater than the sum of the component parts.

When speaking to prospective customers, vendors frequently refer to their products as “solutions.” While this is common practice in the IT industry, it is really a misnomer because to most customers, virtually all of these “solutions” are merely point products. Of course, it also depends on the question being asked. If the question is narrowly focused, then yes, a particular SSD or software tool could be a solution. However, from an end-user’s perspective, most of these are not a complete solution. So, what is the definition of a solution, and why is it important?

A true solution needs to address all aspects of a particular problem. For example, the life sciences industry struggles with both storing and managing massive amounts of data and efficiently analyzing it. A storage product such as WekaIO’s Matrix™ addresses the storage aspect but not the analysis part. Based on that, Matrix alone would not be a solution. Similarly, the analytical tools important to making sense of the data are not a solution either. A mashup of point products is also not a solution. Products need to be integrated, tested, and supported by each of the participating vendors.

At BioIT this year, WekaIO demonstrated a comprehensive solution that can be used for secondary analysis of genomic data sequencing. Secondary analysis is a 3-step process in which data from sequenced genomes coming off the sequencer is read and aligned, compared to a reference genome, and analyzed for any variances. This is very time consuming, I/O intensive work, especially when sequencing multiple genomes.

The solution combined best of breed products from WekaIO, Western Digital Corporation (WDC), PetaGene, and Sentieon. WekaIO provided the Matrix high-performance parallel file system with a flash primary storage tier for active data; WDC’s ActiveScale™ provided a cost-effective capacity tier for long-term storage and data protection; PetaGene’s PetaSuite™ provided best-in-class, lossless data compression technology; and Sentieon provided its award-winning, highly optimized DNASeq™ secondary analysis pipeline.

Demonstrating an End-to-end Solution

The demonstration showcased the interoperability of these products using actual genomic data in a real-world simulation. First, the source FASTQ file was ingested into the cluster and transparently compressed using PetaSuite, meaning that there is no impact on existing workflows. The compressed file is then made available to the DNASeq pipeline for processing. The ability to work on compressed data reduces processing time and resource requirements. WekaIO’s Matrix filesystem significantly accelerates processing by eliminating storage bottlenecks that constrict the flow of data to the analytical pipeline. DNASeq is a highly optimized pipeline based on the Broad Institute’s Genomics Analytical Tool Kit (GATK) gold standard.

After the data is analyzed, it is retained for long periods of time, sometimes indefinitely, requiring a cost-effective means of storing and protecting the data. This is especially challenging as the processed data accumulates. WDC’s ActiveScale object storage is ideally suited for this with data durability of 19-nines and geo-replication across up to three sites for hundreds of petabytes. The figure below shows the architecture of the demo system.


No Modifications to Established Workflows

Solution Value Stacking


Creating an end-to-end genomics solution requires careful component selection in order to create a holistic value that is greater than the mere sum of the parts. The system constructed here provides synergistic value because all its components have been tested together, it requires no changes to existing workflows, and it is built using best-of-breed industry standard genomics tools to ensure accurate and consistent results. The result is a flexible, fully configurable, scalable secondary analysis platform that is far more cost-efficient and performant than other solutions, allowing investigators to complete more research in less time.

For more information, read our joint solutions brief or contact us at Info@Weka.IO .

You may also like:
Accelerating Genomic Sequencing
Learn About Cryo-em Data Storage
Scaling Genomic Sequencing Performance On-Premises or in the Cloud
Using GPUs in Genomic Sequencing

Related Resources

Case Studies
Case Studies

Preymaker VFX Studio-in-the-Cloud

View Preymaker Case Study
White Papers
White Papers

Hyperion Research: HPC Storage TCO – Critical Factors Beyond $/GB

View Now
White Papers
White Papers

A Buyer’s Guide to Modern Storage