What can we learn from the genomes of the novel coronavirus?



Genomic epidemiology of novel coronavirus - Global subsampling, showing 3251 of 3251 genomes sampled between Dec 2019 and Apr 2020. This phylogeny is maintained by the Nextstrain team and enabled by data from GISAID. (Image courtesy: https://nextstrain.org/ncov/global)

COVID-19 is the disease caused by the novel coronavirus SARS-CoV-2. “Bad news wrapped in a protein”, as a recent news story put it, is an apt epithet for this virus which has brought life as we know it to a stand still. This “bad news”, is the genetic material or the genome of the virus, which carries all the information required for the virus to infect and make copies of itself in a host (in this case, humans).

The genome of this virus is a single-stranded RNA (Ribonucleic Acid). Think of it as a thread with four kinds of beads. A particular arrangement of the four kinds of beads (which are chemicals called nucleotides, whose names are usually abbreviated as A, U, G, and C) is what defines this virus. In some sense, this is the virus’s signature or unique identifier. However, as the virus makes copies of itself, there are changes that occur in the arrangement of beads (for instance, an A can take the place of a C, etc). The process of sequencing the viral genome reveals both its unique arrangement as well as changes to it (mutations) as it spreads in the population. These changes allow us to track the virus through space and time.

The genome of SARS-CoV-2 is a long thread with about 30,000 beads. The timely sharing of this viral sequence (the first sequence from China was available on 11 January 2020), allowed the global scientific community to develop tools for detecting the virus. These include the widely used RT-PCR tests to look for pieces of the viral genetic material in people suspected of having COVID-19. The sequence of the Spike protein, a protein on the surface of the virus that allows it to bind to and enter human cells, provided information for developing antibody tests. These tests can tell us who has been infected and is possibly immune (although the length of time for which immunity lasts is currently not known). The part of the viral genome that expresses the Spike protein has also informed the design of multiple vaccines including mRNA-1273, a vaccine candidate from Moderna which is in Phase I of clinical trials.

As of 9 April 2020, more than 5,800 complete sequences of SARS-CoV-2 from over 65 countries have been made available by scientists worldwide (via the Global Initiative on Sharing Avian Influenza Data, GISAID, and the US National Centre for Biotechnology Information, NCBI). Relationships between these different sequences can be visualised using a phylogenetic tree. Much like a family tree, sequences that are very similar are more related to each other and are thought to share a common ancestor. By mapping the relationships among all the sequences, we can track when and where a specific change in a sequence arose.
In the beginning of an outbreak, this helps identify where a particular sequence may have been introduced into a region from (importation events) independent of known travel history; this in turn can help monitor spread of the virus and also inform on the effect of interventions.

Sequencing data and analysis support the emergence of SARS-CoV-2 between Nov-Dec 2019 in China and subsequent spread to other countries. Data from multiple countries including Iceland (where extensive sequencing has been carried out, resulting in over 340 genomes of the virus) and other parts of Europe and South America suggest that multiple introductions of the virus occurred in each country. In Iceland for instance, sequences from initial cases were similar to those from China and South-East Asia. However, as travel restrictions came into place, this trend changed. Most of the sequences in the later stages were similar to sequences from other parts of Europe. Sequences from New York, USA, in March 2020 were found to be related to sequences of the virus in Italy and other parts of Europe, suggesting that after the lockdown in China, Europe became a hub for transmission. Analysis of sequences from the East Coast of the US suggest a West coast to East coast spread of the virus, underscoring the importance of sustained local transmission. A detailed reconstruction of possible spread is available at Nextstrain.org in their situation report on COVID-19.


Image showing the global transmissions of the novel coronavirus. (Image courtesy: https://nextstrain.org/ncov/global)
What about the sequences of SARS-CoV-2 from India?

The two initial sequences of SARS-CoV-2 from India were from people with known travel history to Wuhan, China. Not surprisingly, these sequences are closely related to sequences from China. The two sequences however are different from each other. As Wuhan, China was the epicenter of the outbreak, it is possible that multiple closely related but distinct strains of the virus were circulating there at the time these people got infected. So the differences in the two genomes probably reflects this diversity. As there was no known onward transmission from these two cases, these viral sequences in turn may not tell us much about the viral strains currently being transmitted in India. More sequencing from India is necessary if we want to understand these trends or characterize the strains present in India at this time. The Centre for Cellular and Molecular Biology (CCMB, Hyderabad) and Institute of Genomics and Integrative Biology (IGIB, Delhi) have been identified as additional centres that will sequence and study the molecular epidemiology of the virus in India.
At the time of writing this article, the National Institute of Virology, ICMR has submitted 28 additional sequences to the GISAID database from varied sources sampled in early March 2020. These are invaluable as we trace the spread of the virus in India.

Is the virus changing? Does India have a milder strain of the virus?

The virus is changing, albeit not very fast. In fact, there are relatively few changes in the viral genome. At present it is hard to distinguish these changes from the noise/errors that are generated by the sequencing process itself. So we must avoid the temptation to read too much into them.
Very roughly, one in 30,000 of the positions on the virus changes every 1-2 weeks as the virus continues to replicate and spread. There are particular changes that have been noted in different geographical clusters, and studies/experiments need to be carried out to understand their significance. For example, do these changes reflect a common origin or have they provided some advantage to the virus – for instance, do they help the virus infect, grow, spread or escape the immune system? While these are important and interesting questions, sequencing viruses alone cannot conclusively tell us this. Currently there is no strong evidence for a more virulent or conversely a milder strain of the novel coronavirus. We do not yet have enough sequences from India to do any robust analysis.

Where did the virus come from?

The closest known relatives of the SARS-CoV-2 genomes appear to be viruses of the same family, found in bats, and they show some striking similarities to coronaviruses found in pangolins. It remains possible that a virus from an animal that we haven’t yet sampled was the actual host from which humans got the virus. There are two possible processes that may underlie how the virus became a human pathogen. It may have acquired some beneficial traits in an animal and then infected humans or it may have become better at infection and spreading in humans by changes acquired after infecting humans. With the evidence gathered from all the sequences we have so far, it is unlikely that this was an engineered virus (the virus is not similar enough to other “known” members of the family) or that these human adaptations occured in a lab. Analysis of sequencing data suggests little or no ongoing transmission from animals to humans and the virus seems to be sustained in the human population by human-to-human spread. The virus is able to infect and grow in cats, ferrets and tigers; however, under experimental conditions it did not successfully infect dogs, pigs or ducks.

Changing the culture of science one genome at a time

Many researchers across the globe, including members of the ARTIC Network, the ZIBRA project and the creators of Nextstrain.org, have been sharing protocols and reagents, generating sequences, rapidly sharing these sequences, analyzing them in real time and creating stunning visualizations and infographics with breathtaking momentum. The trend that they have set for open data sharing and a culture of collaboration is setting the standard for the field. Science has never moved in such a (relatively) inclusive manner and with such transparency before, allowing us to follow the story of SARS-CoV-2 as it spreads to every nook and corner of the world.

Note: The story of the novel coronavirus SARS-CoV-2 is evolving rapidly and as new information comes to light, the ideas in this article may change. Some of the research work I have cited has not yet been vetted by other scientists in a formal manner. I have tried to be mindful of this while writing this article, and request the readers to be mindful while reading this and other articles on COVID-19. I would also like to thank Nithyanand Rao, Smita Jain, Bhagteshwar Singh and Krishnapriya Tamma for their comments and suggestions on this article.

Chitra Pattabiraman (PhD) is an India Alliance Early Career Fellow at the Department of Neurovirology, National Institute of Mental Health and Neurosciences (NIMHANS), Bangalore. She uses sequencing to identify pathogens in human disease.