Graphen explores the mutations of the viruses to construct the evolution tree
Based on the 24,219 reported genome sequences (as on May 12, 2020) of the COVID-19 virus (SARS-CoV-2) from worldwide labs, Graphen Inc., in conjunction with Columbia University, is able to align the genome of viruses, look for the canonical form of each gene location, and identify the exact variant(s) of a virus.
Each virus has nearly 30k bases with each position represented by one of ATCG, the cDNA of virus. A virtual Canonical form sequence was determined independently in each position, not based on a single virus. In the Canonical form, the genomic length of a virus is 29,816. The letter of each position of the Canonical form was independently determined by the consensus of all sequenced viruses. Genome sequences need to be first aligned. The max position number after alignment is 30,532 which includes some head/tail/empty holes. After the canonical form is available, for each virus, we can then identify its exact variation in each position by comparing to the Canonical form. So far, this Canonical form sequence stays the same since we started analyzing it on the data before March 1.
The majority of labs reported whole genome sequencing of the viruses. Because genomic variation happens when the virus reproduces itself, the locations of variants can be considered as evidence of how they evolve. For instance, if Virus A mutates at Position 100 from C to T, and Virus B mutates at Position 100 from C to T and at Position 356 from A to G, then we probably can infer there is a high chance that Virus B is a child of Virus A, which is a child of the Canonical form virus.
Graphen then uses its Ardi AI platform which mimics full function of human brain to store these mutation data and visualize them.
In the visualization, each node represents a set of viruses possessing exactly the same genome sequence. The line between two nodes represents the direction of evolution. The size of the node represent the number of virus belong to this virus set. The color of the node shows the different clusters defined in the following finding. Users can click on a node to see its related info. Users can zoom in and zoom out the graph and search based on the region.
Acknowledgement: We appreciate GISAID for hosting the EpiCoV database and the worldwide labs who shared their sequenced virus info that made this research possible.
We can make several observations from the graph based on these criteria:
Global Coronavirus can be divided into eight categories. The more it spreads over time, the more the virus mutates
The virus evolution tree has been expanding since December last year. According to the current sequence comparison of 24,219 virus strains, viruses can be roughly divided into eight categories (A1, A2, B, C, D, E, F, G, H). The root parents with 2 virus consist of Cluster A. Then these two virus further transport in their own direction, one of them mutate into Cluster E and Cluster F, another one develop to Cluster B, C and D. While Cluster C further mutate into Cluster G and Cluster H.
Each cluster has their own attribute and distribution area characteristic. Cluster B mainly founded in China; Cluster C rooted in Europe and later it mutated to cluster G in Europe and Cluster H in Estern United States; Cluster D transmit across United Kingdom, Netherland and Hong Kong; Cluster E mainly distributed in Western United States; while Cluster F is quite scattered distributed across China, South Korea, Australia, Spain, etc.
The overall visualization with 8 cluster labelled by May 12, 2020 is shown in the below figure. By May 12, there are more than 20 thousand virus sequence has been uploaded to GISAID.
Cluster A: The most similar virus sequence to the potential host - bat
There are only 2 virus in Cluster A with 2 mutation position apart, A1 and A2. They are the L and S types mentioned in an earlier Chinese paper. A1 is more concentrated in Wuhan, A2 is distributed throughout China, including Wuhan, Jiangxi, Shandong, Zhejiang, etc. The two virus strains are separated very early, while the other viruses are their respective offspring, so these two should be the earliest of this virus mutations.
Cluster C: rooted in Europe
On Jan 28, the first virus in Europe is collected by a patient in Munich, Germany, which mutated from the A1 virus. Since then, it start to pandemic across different region in Europe, including the pandemic happend in North Italy on Feb, 20 and Milan.
Cluster G: Pandemic in Europe and transport to South America
This cluster including one of the largest virus set that mainly composed by the virus from Europe different regions, including Portugal, Switzerland, Russia, Ireland, Italy, Belgium, United Kingdom, Netherlands. Later, it even transmit into Brazil and become the source of pandemic in sourth America.
Cluster H: From France to Easter United States
H virus is currently the main infectious virus strain on the east coast of the United States. It can be seen from the evolutionary tree that this mutant strain is from France. It is characterized by S protein and ORF3 protein mutations. The function of ORF3 is to pierce the host cell membrane and allow the replicated virus to spread. Ching-Yung Lin was also surprised by this result, because it has not yet been seen elsewhere. The phenomenon of "high purity" virus like New York, the H virus alone accounts for 85%, and the mutation may have made the H virus spread faster more aggressively than other types.
Cluster E: Pandemic in Western United States and Canada
On Jan 19, the first case has been confirmed in Washington, USA. It is from a 35-year-old male who travelled from Wuhan. Later, it appeared in California at the end of Jan. Then it didn't break out until Feb 20. The below figure shows cluster E.