Using GPUs in Genomic Sequencing
Shimon Ben David. July 28, 2020
What is genomic sequencing?
Genomic sequencing is the process of figuring out the DNA sequence of an organism such as a person, animal or even a virus. Genomic sequencing is used to detect diseases, develop vaccines, trace the origins of a person and other use cases. Genomic sequencing is performed by taking a DNA sample and running it through a sequencer that will produce a stream of DNA that will need to be analyzed and sorted.
DNA is the building block of organisms such as people, animals, germs, and other beings. There are two major types of sequencing, the original type is called Sanger sequencing and then there are newer types that are collectively called High Throughput Sequencing (HTS) or Next-Generation Sequencing (NGS). Each type of sequencing has its advantages as well as it’s requirements, some will require longer physical preparations while others would require more compute power to go over the results. The following steps would usually be involved when sequencing. Taking a DNA sample to be sequenced from the subject, preparing it in such a way that it will be ready for the sequencer and then running it through the genome sequencer which will produce a stream of DNA that will need to be aligned, analyzed and sorted. The eventual output is a text file that describes the DNA string. This output can then be compared to different populations to identify different traits, for example compare a person that is suspected of having a certain genetic disease with a population of people that are known to have it, will allow determining if that person has the same disease as well as decide which medication will have the best effect for a possible treatment. Another use case is comparing the sequenced results with different populations to identify ancestry, since populations that are more geographically and physically remote tend to differ more.
GPUs & Genomic Sequencing
The process of genomic sequencing is a lengthy one that requires going over massive amounts of data and therefore it takes a long time to produce results and may require multiple runs and iterations. While most sequencing labs are using the broad’s institute Genome Analysis Tool Kit (GATK) with a mixture of self-written applications that can accelerate it further. A trend that has started in the life science world is the usage of Graphic Processing Units (GPUs) to help accelerate that process significantly. This allows multiple steps within the process to complete faster and therefore allow for cheaper and faster sequencing overall.
The challenge of using GPUs in genomic sequencing is that in order to benefit from their performance there is a need to accommodate for the GPUs data requirements and make sure that the GPU is not waiting on data in order to proceed to the next steps. Since each step is different with different file formats and different access patterns and sizes this is not a trivial task. Additionally, after completing the accelerated workload using the GPUs there is a need to feed the GPUs with the next genome data so that they can immediately analyze that as well.
Conventional solutions where to copy the data to the local GPU server analyze it and the copy the results to a centralized low performance storage while copying new data to the local GPU server storage for new analysis.