Graphen explores the mutations of the viruses to construct the evolution tree
Based on the 4,727 reported genome sequences (as on April 7, 2020) of the COVID-19 virus (SARS-CoV-2) from worldwide labs, the Graphen team, in conjunction with Columbia University, is able to align the genome of viruses, look for the canonical form of each gene location, and identify the exact variant(s) of a virus.
Each virus has nearly 30k bases with each position represented by one of ATCG, the cDNA of virus. A virtual Canonical form sequence was determined independently in each position, not based on a single virus. In the Canonical form, the genomic length of a virus is 29,816. The letter of each position of the Canonical form was independently determined by the consensus of all sequenced viruses. Genome sequences need to be first aligned. The max position number after alignment is 30,532 which includes some head/tail/empty holes. After the canonical form is available, for each virus, we can then identify its exact variation in each position by comparing to the Canonical form. So far, this Canonical form sequence stays the same since we started analyzing it on the data before March 1.
The majority of labs reported whole genome sequencing of the viruses. Because genomic variation happens when the virus reproduces itself, the locations of variants can be considered as evidence of how they evolve. For instance, if Virus A mutates at Position 100 from C to T, and Virus B mutates at Position 100 from C to T and at Position 356 from A to G, then we probably can infer there is a high chance that Virus B is a child of Virus A, which is a child of the Canonical form virus.
Graphen then uses its Ardi AI platform which mimics full function of human brain to store these mutation data and visualize them.
In the visualization, each red node represents a virus. Each green node represents a set of viruses possessing exactly the same genome sequence. A line between two green nodes represents the direction of evolution. A line between red and green nodes represents which set the virus belongs to. Users can click on a node to see its related info. For instance, exact gene mutation information is represented by a green node. A virus' information including location, gender, age, etc can be seen by clicking a red node. Users can zoom in and zoom out the graph and search based on the region.
Acknowledgement: We appreciate GISAID for hosting the EpiCoV database and the worldwide labs who shared their sequenced virus info that made this research possible.
We can make several observations from the graph based on these criteria:
Here is an example of analysis:
First, we can look at the whole evolution pathway graph as of March 10.
We then noticed an obvious cluster at the bottom right of graph. If we zoom in, then we can see this part of graph.
This subset of graph is composed of an obvious root at the top left corner which is called 15_Virus_Set. This set differs from the Canonical form at 3 locations: Position 9099 mutated from C to T, Position 18397 C-> T, and Position 28487 T-> C. This virus set includes four viruses: USA WA1 (collected on 1/19/2020), USA WA1-A12 (1/25/2020), USA WA1-F6 (1/25/2020), and China Fujian-8 (1/21/2020).
Then, this 15_Virus_Set mutated to a set in the middle of this image that represents 6 viruses (5 in Washington state and 1 in California). This virus further mutated and propagated to create 15 types of variations in Washington and California. The furtherest in the tree, a virus in California, shows mutation in 10 locations from the Canonical form.
In addition to propagating to Washington and California, 3 variants of this 15_Virus_Set can be also found in 4 viruses in China (two from China CDC, 1 in Chongqing city, and 1 in Henan province.)
Then, if we zoom out, we can find the parent of this 15_Virus_Set.
38_Virus_Set includes two of the three variants of the 15_Virus_Set. We can then infer 15_Virus_Set was probably mutated from 38_Virus_Set (Position 9099 C->T and Position 28487 T-> C). We can zoom in to see the details of 38_Virus_Set.
This 38_Virus_Set has 12 viruses. 9 of them were found in Mainland China, 2 were in Taiwan, and 1 in Australia. Among the 9 viruses in China, 7 were from Wuhan and 2 were in Guangdong, all collected in January or early February. Taiwan's 2 such viruses were collected on 1/24/2020 and 1/31/2020. Australia's case was collected on 1/24/2020.
In this image, we can also see its other variants propagated to many places in China, as well as Belgium and Vietnam.
Users are welcome to explore the graph. Please note each virus node has the gender and age information, if provided by the lab. Our visualization also shows a Completeness value which shows the ratio of genome sequence compared to the Canonical form. A completeness value of 100% means this virus has exact letter info on all 29,834 positions. A value of 99% means 29 letters were not compared, might be due to some minor missing data. If some labs (e.g., Hong Kong, Iran, Phillipines, etc) only reported a small portion of the whole genome sequence, then this completeness ratio is small.