Complete Guide to Genomics Data Storage
December 9, 2022
Researchers in genetics are at the cutting edge of medical science and are handling massive quantities of data to do so.
How is genomics data stored? They are stored in secure, resilient storage using parallel access in hybrid systems where speed, performance, and privacy are paramount.
What Is a Genome?
The genome is the entirety of an organism’s genetic information. This doesn’t mean it is the entire scope of that organism’s genetic material. Still, rather it is the complete representation of the sequences of information that serve as the instructions with genetic material and eventually, cells are created.
Generally speaking, this typically refers to the DNA, and in some cases RNA, molecules carrying genetic information.
Since the discovery of DNA and its structure, understanding how this information becomes the basic building blocks of life has been a massive and complex question in the life sciences. One of the central questions of these inquiries into DNA and genomics has revolved around how to sequence the information in a genome.
Each genome consists of nucleotides that scientists label as letters (A, C, G, T)that make up chromosomes. Between members of the same species, there is generally little deviation from common chromosome makeup. As living entities drift further apart genetically, we see new species.
Sequencing and understanding DNA sequences could unlock the secrets of mitigating or curing genetic diseases, minimizing hereditary genetic issues (blood disease, high blood pressure, etc.), and building new sequences to help fight evolving viruses and bacteria.
Genomic Data and Cloud Storage
One of the significant challenges of storing genomic information is that the data is reasonably massive. The human genome, for example, is more than 200 GB… and that’s just the actual sequence (literally a series of “A”s, “C”s, “G”s, and “T”s).
And that’s just a general sequence. It’s estimated that storing the data generated in processing human genomic sequences worldwide will require about 40 exabytes.
That massive data requirement comes from a few challenges:
- Metadata: It’s not enough to just store a genome sequence. Scientists must add their annotations, address gaps in the chain, etc., all of which require additional data.
- Sequencing: Scientists are not simply reading DNA and then typing out the sequences. They are using powerful analytics and sequencing tools to understand gaps or aberrations in the sequence better. This, in turn, generates an exponentially gigantic pool of research data.
- Privacy: We often use the term “human genome” as a generic placeholder… and that’s something that scientists are studying broadly. However, they also research the individual sequences of real people with real health issues. The ethical impact of lost genomic sequence information is profound. Privacy and security controls must be in place for any storage solution.
- Workflows: The processing of genomic data involves parallel processing, retrieval, compression/decompression, and verification processes that run multiple times per day. Storing genomic data means having robust and reliable workflows to handle extensive data.
- Chain of Custody: Accountability is a significant problem when dealing with large data sets where integrity and privacy are necessary concerns. Having a logging and monitoring solution in place to ensure that ownership and activity can be traced ensures that scientists can trust the data they are working with (and track back through that forensic trail as needed).
What Should I Look for in a Genomic Storage Solution?
OK, so the challenges are many, the data is complex, and you know that you need a lot of storage even to begin a genomic project. A cloud storage and processing solution is the right way to go. But not all cloud systems are created equal, and not all functions and features will ensure your project is successful.
So, what should you look for in a genomic storage solution?
- Scalability: Perhaps most importantly, the solution must be able to grow when and how you need it. Any solution you consider should have the ability to scale out without compromising any of the other features discussed here–which will almost certainly mean a hybrid cloud environment.
- Security and Compliance: Genomic information is healthcare information, and even if your project falls under sanctioned healthcare research, you will still face significant demands from regulatory agencies. And rightly so, considering the ethical issues of storing and using a person’s genetic information. Your solution must be able to secure information against unauthorized disclosure at all points of storage, and there will most likely be considerations for any regulations from the Department of Health and Human Services and HIPAA regulations.
- Object Storage: When handling data, it’s usually more usual to eschew strict organization and labeling until the team is ready to perform specific analytics. That means working with robust object storage, forgoing data warehouses as the primary form of storage, and relying on data lakes for fast and scalable access.
- Parallel File Access: Speaking of access, consider a solution that can support parallel file access. This offers rapid, reliable file retrieval services to help reduce performance bottlenecks.
- Support for Hot and Cold Storage: Maintain backups, maintain resilient recovery systems, and maintain regular archives. A strong candidate for genomic cloud storage should be able to support a hot/cold storage strategy to keep costs, risks, and performance issues down.
Meet Your Genomic Data Storage Challenges with WEKA
The sequencing of genomic data is one of the most challenging and important research areas in medicine. Cloud computing has enabled researchers to make massive strides in understanding our genetic information. To continue this progress, data scientists must work with tools supporting this amazing mission–fast, reliable, scalable, and always-on data storage.
With WEKA, you get the following benefits:
- Streamlined and fast cloud file systems to combine multiple sources into a single high-performance computing system
- Industry-best GPUDirect performance (113 Gbps for a single DGX-2 and 162 Gbps for a single DGX A100)
- In-flight and at-rest encryption for governance, risk, and compliance requirements
- Agile access and management for edge, core, and cloud development
- Scalability up to exabytes of storage across billions of files
Contact our team of experts to learn more about WEKA and edge computing architecture.