The Future of Biology and Artificial Intelligence – Part 3
WekaIO Inc. December 10, 2020
The Need for Flexible Architectures and Organized Data
What concrete steps can a technology architect take to future-proof their plans, like those I touched upon in my first and second blog posts in this series? In my opinion, it all comes down to agility and incremental improvement. In my experience, the agility to deal with changing requirements is the most important property of computing and data storage systems for genomics.
Figure 1: Agility enables rapid reconfiguration to account for changing requirements.
In 2020 this agility drives us towards software-defined solutions that can be agnostic about the hardware infrastructure on which they run. Insisting early in a project that all infrastructure and configuration be defined by code rather than inextricably bound to hardware allows us to compensate for the lack of detailed knowledge available at that time–and to rapidly re-configure to account for changing requirements. In his 1975 classic, The Mythical Man Month, Fred Brooks says, “In most projects, the first system built is barely usable…plan to throw one away; you will, anyhow.” Software defined infrastructure is an enabling technology for this approach.
This will be unsurprising to anyone who has worked with online or digital services in the last decade. Software-defined filesystems, networks, and computing infrastructure are the rule among teams who have had the option to embrace these technologies.
Figure 2: Data architectures need to be flexible, growing and changing to fit needs.
Our systems must also be able to represent a rich set of metadata, and that metadata must be capable of change over the life of a project. We need flexible data architectures. In my experience, there is no reliable way to identify the most important files, fields, or identifiers ahead of time. In pure research, in drug development, and in the clinic we always find ourselves re-analyzing a long tail of historical data. Mindlessly migrating older data to slower and cheaper storage (a common data lifecycle in many industries) can create expensive challenges for re-use.
Even worse, we are constantly refreshing our idea of what constitutes the data as opposed to the noise. A blood sample that we used to determine a patient’s genome last year might be used this year as a “liquid biopsy” in which we look for snippets of tumor DNA circulating in the bloodstream. The bits we filtered out and ignored yesterday frequently become today’s valuable insights. This has pushed me towards flexible “NoSQL” database architectures and semantic triple stores. Projects that start off by trying to engineer a perfect relational schema tend not to see the light of day.
It’s Made of People
At a panel about data at this year’s Bio-IT World Conference, I asked the participants for advice about what essential components and capabilities to include in new projects. I expected to hear about some hot new technology, but instead they suggested hiring an ontologist, an expert in terminology and language. This person would own the data dictionary of the project and keep the team from getting tripped up in confusion over their own information.
It takes careful thought to align the incentives of the data producers with the long-term needs of the organization. Well annotated, organized data is far from free. The benefit of the work to make data re-usable rarely accrues directly to the creators of that data. High throughput laboratories, in particular, are constantly under the microscope to reduce costs per sample. It takes sustained effort by leadership to assign the kind of value to data that will survive good, practical, frugal management.
It’s important that we get this stuff right. I believe that there are clues to the understanding and cure of disease languishing in those shelves of disks that I slammed together over the last 20 years. Much data sits unused for lack of the organizational willpower to thaw it out from tape or other “cold” storage media. Data segmentation into silos also holds us back. To be clear, there are good reasons to be measured and careful about sharing and combining data–but mere fear or hoarding are not among those reasons.
I’m impatient for the new therapies and cures that will become available when we figure this out. I’m looking forward to having to re-think everything that I’ve learned so far about this field. That’s what’s got me excited for the next 20 years.
Chris Dwan is the Vice President of Production Bioinformatics at Sema4. Previously, he led the scientific computing team at the Broad Institute, helped to build the New York Genome Center, and led the consulting team at BioTeam. Chris tweets frequently at @fdmts and blogs occasionally at https://dwan.org.