[This is a somewhat long summary of my thoughts from the recently concluded 12th annual Advances in Genome Biology and Technology (AGBT) meeting at the Marco Island Resort. Do note that while I attended the meeting representing the company I work for, all the opinions expressed here are solely my own and do not represent those of my employer or any of our funding agencies. Apologies in advance to those in genomics for the occasional naivete about the field.]
If not the biggest, AGBT certainly seems to be the most popular gathering for genomics research. The fact that they sold out within 36 hours after registrations opened bears testimony to this. The beach-front location, with its promise of sunny southern-west Florida weather does help (especially this year when half the country was under gazillion inches of snow), as does the wonderful wining and dining opportunities. But in recent years, the AGBT has also gained reputation as the site for major announcements by genomics companies; both Pacific Biosciences (PacBio) and Ion Torrent released their potentially paradigm-shifting sequencers during the 2010 meeting.
As someone who has only recently entered the field of genomics, that too from the technology aspect of DNA sequencing, my goal as a first time attendee was simply to get a good perspective of the field. This was easily achieved, thanks mainly to the wonderful diversity of the scheduled scientific talks as well as the attendees. In addition to academic and corporate scientists, the meeting was well attended by various non-scientific professionals including people from investment and venture capital firms, science writers, computer network specialists etc. The structure of the conference allowed conversing at length with this variety of people during meals or over drinks during the social mixers resulting in a very gratifying and enriching experience.
The only fly in the ointment was perhaps the slightly unfair blogging/tweeting policy. Though the organizers should be commended for having a clear-cut policy for sharing information presented at the meeting over the internet, they decided it would be an opt-in rather than opt-out. So even if someone was okay with their lecture being tweeted, unless this fact was announced explicitly, one had to maintain twitter and blog silence. Adding to this, it appeared that many speakers did not fully understand twitter and blogging and ended up refusing permission even when they presented fully published data! Interestingly, more and more speakers were okay with tweeting/blogging as the conference progressed, indicating that they were possibly better informed by then.
As such in this post, I will try to avoid synopsis of every talk I attended, and provide a general summary instead.
(If you are interested in more details, Anthony Fejes has done an extraordinary job of compiling on-the-spot notes on talks that did allow dissemination. Additionally, the #AGBT hashtag on Twitter has may real-time observations on the scientific sessions).
While the meeting was about ‘advanced genomics’, currently the field is dominated by next-generation sequencing (NGS) technologies i.e. post-Sanger sequencing methods and its applications. If you’ve been keeping up with scientific literature, you will know that the cost and speed of NGS has been falling remarkably fast, in fact, at a rate faster than Moore’s Law. This ever-reducing cost, and the emerging availability of both turn-key machines and centralized rapid sequencing centers (e.g Complete Genomics, and recently, Perkin-Elmer), has made NGS readily available for various scientific and clinical use. A consequence of this so-called ‘democratization of sequencing ’is that a plethora of applications are developing around it, from cancer genomics to microbiome sequencing, to pathogen detection.
I will get back to some of these applications briefly. For the moment, looking at a broader perspective, consensus seems to be that the future challenge (and potential bottlenecks) in genomics will be two folds – data, and meaning of data.
The problem with data is that there are huge amounts of it. We are talking at the levels of tera and petabytes (e.g NHGRI is currently storing about 400TB of data). The word ‘data deluge’was thrown around more than once over the course of three days. The NGS machines produce enormous amounts of data stream (an Illumina HiSeq machine could generate terabases per week) and both storing and transferring them is an issue. Large genome centers seem to be doing well here,by virtue of possessing enough square footage and in-house capabilities for software development and analysis. For smaller centers and individual labs,clouds or other solutions are required. Data transfer companies, previously involved in other fields that require similar large transfers e.g media companies, intelligence etc are moving to offer solutions in this area.
Additionally, with the short reads of many of the instruments, assembling genomic data, especially for de novo assembly, is another important issue requiring a bioinformatics solution.
During one of their presentations, PacBio stated that the data deluge problem wouldn’t be solved by ‘anyone in the room’, but by people who have worked for Amazon, e-Bay or Facebook. I do not know if any biostatistician or network professional (there were a few of them in the meeting) felt hurt at this, but PacBio did introduce someone (I’ve forgotten the name) they’ve recruited from Facebook to harness the data. The company’s underlying philosophy is the find a solution that flows as: data → information → knowledge → wisdom. Neat mantra, but they did not divulge exact details of how they were going to do it.
The second challenge –inherent in the PacBio philosophy described above – how do we interpret this data/information? This problem is at a biology level and basically asking, ‘how does genes relate to particular phenotype’ (the information → knowledge pipeline). This is a complicated question, and depends on good study designs. However, once again, biostatisticians/bioinformaticians are required to develop tools for solving these puzzles (hint for anyone currently considering a PhD in bio-related field).
Somewhat related to the data issue was an interesting perspective from Ellen Wright Clayton of Vanderbilt. Her somewhat controversial point was that the ‘tsunami’ of genomics data related to patients could overwhelm the healthcare system. The way I understood her contention: we do not know enough biology from the genomic information, but given that the genome information is available, patients may demand access to that data. This is particularly relevant for new-born sequencing – parents may want to access information to genomic data, without the capacity of understanding its meaning. Her point is that this is dangerous since genomic data does not translate directly into phenotypes – epigenetics, metagenomics etc play important roles (she gave example of the bad science of predictive genomics in ‘GATTACA’ as an example). Therefore, as an aid to public policy, scientists should try to analyze the genomic data as quickly as possible.
I certainly agree that interpretation of genomic data should be a priority, but as already mentioned, it is not a trivial task. Meanwhile, the release of genomic data to patients is a more complicated ethical and policy question. It is better to tackle it later in a separate blog post (anyone reading this are welcome to leave a comment about their thoughts on the issue).
Science and Applications:
Cancer biology is the most obvious target of NGS approaches. Several findings were presented where novel mutations, rearrangements etc were discovered by high-throughput sequencing, especially using the 454 system (which has a very low error-rate). Unfortunately, for most of these presentations, there was a no blogging policy. And it was difficult for me to follow many of these talks anyway, mainly due to the abundance of jargons and unexplained lingos.
The most exciting scientific breakthroughs presented at AGBT –at least for me –were with respect to human microbiomes. Rob Knight from University of Colorado gave a highly entertaining talk on this topic. Recent publications have demonstrated that the nature of bacterial fauna we carry in our gut impacts aspects of our health (in particular, obesity) and interaction with drugs. Additionally, while human beings have 99% similarity in their genomic DNA, our symbiont bacterial genome (the microbiome) could differ as much as 40%!
Therefore, the Knight lab is developing experimental and bioinformatics tools that use NGS to obtain a biogeographical landscape of microbial populations in the body (and things we touch, like computer keyboards). Among various interesting results, it seems that different parts of our face have different microbial population (but the distribution roughly follows the facial symmetry),and these may shift over time. Another interesting tidbit was how they found that microbial populations growing in extreme environments are not necessarily genomic outliers, but some of those found in the human gut are.
Joseph Petrosino from Baylor, followed Knight with a similar talk, except he was trying to map the viral metagenome in humans. His ‘tweetable ’moment came when he talked about how to distinguish the nose and the ass by the difference in virus genomes from the two orifices. (He also mentioned how Baylor was actively involved with the Houston zoo in trying find a cure for Elephant Herpres using NGS to determine the full genome of the virus)
Both Petrosino and Knight did agree about one negative about their approach: their sequencing methods could overlook certain microbial species that have not been sequenced before or are difficult to sequence. But the conclusion from both these talks seemed to be that the goal of personalized medicine and discovery of therapeutics could come faster from the study of microbiomes rather than the human genome.
Perhaps the weirdest application of microbial genomics was presented by Zhong Wang of JGI. They are attempting to discover novel enzymes that can chew cellulose. The approach is to perform NGS on samples directly taken from the rumen of cows fed on switch grass. To do this, they use a ‘cow with a door’,ie a fistulated cow where you can directly insert your hand and pull out stuff from inside! They seem to have found some novel cellulase through the full-scale genomics approach, and this was validated by using the cow rumen in an enzymatic assay against cellulose substrates. However, I believe they are still trying to assemble and validate the full sequences of these cellulose genes.
While on the topic of microbes, pathogen surveillance by NGS methods emerged as an area of great interest. The main advantage of NGS as opposed to current pathogen detection assays is that artificial genetic modifications can be detected, and full sequence also allows identification of the source of the material based on markers.
In this context, PacBio’s Eric Schadt presented their vision for fast sequencing using the SMRT technology to build a disease ‘weathermap’, which could help predict outbreaks. Their contention is that the rapid sequencing provided by their single-molecule technology is ideal for such purposes. The recent rapid identification of the origins of the Haitian cholera by PacBio was demonstrated as an example (I did notice though that they required the sequence information from Illumina collected by the CDC for complete assembly of that cholera genome! But more on this in the technology section). Preliminary data from sequencing of samples from two sewage treatment plants in the Bay Area, as well as swabs taken from various areas in the workplace, public areas and employees were presented as pilot studies. This lead to discovery of H1N1 in nasal swabs of various employees before they had symptoms (though it wasn’t enough to prevent the flu in those cases). They also discovered virus genomes related to pigs and cows on the office refrigerator doors, obviously having originated from lunch meats! A lesson that hands should be washed carefully.
The grand plan, however,is to obtain such sequencing information from various public (e.g restaurants) and private places (down to the level of individual houses) rapidly collected and updated on Google maps, enabling people to take real-time decisions on avoiding certain areas! I am not really convinced of the usefulness of such a high resolution mapping of microorganisms. The logistics of such an endeavor itself is many years off – specially with the PacBio machine, given the latter’s costs and lack of portability.
However, a broader surveillance map (at the city blocks,or even city/county level) might be handy to monitor and check disease progression.
Interest in rapid pathogen detection also comes from US government entities such as the FDA,as well as the military. Lauren McNew from the US military’s Edgewood Chemical Biological Center spoke about how they are developing threat responses based on NGS. Their recent exercises have been based on both purified single DNA,mixed DNA and DNA spiked with bacterial organisms. They have been able to complete sample preparation, sequencing with SOLiD 454 and a complete report on the organism, its origins etc in 36 hours (and yes, apparently they did stay at the lab for 36 continuous hours. Beat that you whining grad students).
In addition, there were couple of interesting talks about the tomato genome and the genome of a pathogens infesting potatoes (same as the one that caused the Irish famine) that demonstrated how NGS approaches could be used for improving food production. Unfortunately, my own notes are sketchy here and I will again refer you to Anthony’s real-time notes here and here.
One aspect of genomic data application that I did not hear much about (perhaps were covered in some sessions I could not attend) was the use of NGS in clinical settings. While it is obvious that further development of turn-key sequencing solutions is required for sequencing to enter the doctor’s office, I was interested in learning if any current genomic data is sufficient to make diagnostic or treatment calls.
Overall, it was quite exciting to note the wide variety of current and potential applications of next-generation genomics technologies.