In March 2020, when the WHO declared the epidemic, the public sequence database GISID contained 524 covid sequences. Scientists uploaded 6,000 more in the coming months. By the end of May, the total was over 35,000. (In contrast, global scientists have added 40,000 flu sequences to GISAID throughout 2019.)

“Without a name, forget about it – we don’t understand what other people are saying,” says Dr. Andersen Kernerson Brito of the Yale School of Public Health’s Genomic Epidemiology Post, which contributed to Pango’s efforts.

As the number of covid sequences increased, researchers trying to study them were forced to create entirely new infrastructures and standards on the fly. A universal nomenclature has been the most important component of this effort: without it, scientists would struggle to talk to each other about how the descendants of the virus travel and how they are changing – either to flag a question or to take it more seriously. To sound the alarm.

Where Pango came from

In April 2020, a handful of leading virologists from the UK and Australia and Australia proposed letters and numbers for naming the Kovid family lineage or new branches. It had a logic and a hierarchy, however, as the names it created, such as b.1..7..7 a have a few mouths.

One of the authors on this paper was Un O’Toole, a PhD candidate at the University of Edinburgh. Soon he will actually become the primary person doing that person sorting and sorting, eventually combining thousands of sequences by hand.

He says: “At the very beginning, he was the only one available to curate the sequence. It ended up being my job for good. I believe we will never get the scale we thought we would get. “

She quickly created building software to assign a new genome to the right generation. Shortly afterwards, another researcher, Emily Scherer, created a machine-learning algorithm to speed things up.

“Without a name, forget about it – we don’t understand what other people are saying.”

Anderson Brito, Yale School Public Health

They named the software Pangolin, a tongue-in-cheek reference to the discussion about the animal origins of covid. (The whole system is now simply known as Pango.)

The naming system, along with the software to implement it, quickly became a global necessity. However, the WHO has recently started using Greek letters for variants, like Delta, for those nicknames for the public and the media. Delta actually refers to a variety of growing family, which scientists call by their more specific Pango names: B.1.617.2, AY1, AY2, and AY3.

“When alpha emerged in the UK, Pango made it very easy to detect that mutation in our genome, whether it’s in our country or not,” says Jolly. “Since then, Pango has been used as a basic outline for reporting and monitoring of forms in India.”

Because Pango offers a rational, systematic approach to what would otherwise be chaos, it could forever change the name of scientists’ viral strains – allowing experts from around the world to work together on shared vocabulary. “Probably, this is the format we’re going to use to find another new virus,” says Brito.

Many basic tools for tracking covid genomes have been developed and maintained over the past two years by early career scientists such as O’Toole and Share. As the need for global co-operation erupted, scientists rushed to support it with Pango-like ad-hoc infrastructure. Much of his work has fallen on tech-savvy young researchers in the 20’s and 30’s. They used informal networks and tools that were open source – meaning they were free to use, and anyone could volunteer to add tweaks and improvements.

“People in the core area of ​​new technologies are grade students and postdocks,” says Angie Hinichs, a bioinformatist at UC Santa Cruz, who joined the Pangolin project earlier this year. For example, Ndul & Scale, a genomic pathologist who, after receiving them from Chinese scientists, posted the first public covid sequence posted online, working in a genomic epidemiology lab. “They were just put in place to fully supply these tools which became quite complex,” says Heinrichs.

Fast building

It’s not easy. For most of 2020, O’Toole took most of the responsibility for identifying and naming the new lineage by himself. The university was closed, but she and Rambout’s PhD student, Variety Hill, were allowed to attend the fees. Her trip, a 40-minute walk from the solitary apartment where she lived alone, gave her a bit of normalcy.

Every few weeks, O’Toole will download the full Covid repository from the GISAID database, growing rapidly each time. She would then search for groups of genomes with similar-looking mutations, or things that seemed strange and perhaps misleading.

When she got particularly stuck, Hill, Rambaut and other members of the lab agreed to discuss the positions. But the intrusive work fell on him.

“Imagine going through 20,000 sequences from 100 different places in the world. I saw sequences from places I had never heard of.”

Un O’Toole, University of Edinburgh

Deciding when a descendant of a virus qualifies for a new family name can be as much an art as science. It was a pressing process, examining numerous genomes and asking time again: is this a new type of covid or not?

“It was pretty boring,” he says. “But he was always harassing me. Imagine passing 20,000 sequences from 100 different places around the world. I saw sequences of places I had never heard. “

Over time, O’Toole struggled to sort and name new volumes of genomes.

As of June 2020, the GISID database contained more than 57,000 sequences, and O Tool sorted them into 39 types. By November 2020, a month after she was due to turn in her thesis, O’Toole took her last solo run by Data. It took her 10 days to go through all the sequences, which by then was numbering 200,000. (Although Covid has challenged her research on other research, she is putting a chapter on Pango in her thesis.)

Fortunately, Pango software is built to be a software collaborator, and others have moved on. A community naline community – which she turned to Jolly when she saw that she had done a variety of cleaning all over India – spread and grew. This year, O’Toole’s work has been much more in hand. The new descent is now mostly recruited when pathologists from around the world, O’Toole and the rest of the team are contacted via Twitter, email or GitHub through the method of their choice.

Otul says, ‘Now it’s more reactive. “If a group of researchers somewhere in the world is working on some data and they believe they have identified a new lineage, they can make a request.”

The flood of data continues. This past spring, you, the team caught a “pangothon”, a kind of hackathon in which they sorted 800,000 sequences in about 1,200 descents.

“We gave ourselves three solid days,” says Otul. “It’s been two weeks.”

Since then, the Pango team has recruited several more volunteers, such as UCSC researcher Hindrix and Yale researcher Brito, both of whom initially joined by adding their two cents to the Twitter and GitHub page. Chris Ruis, a postdoctoral fellow at Cambridge University, is focused on helping O’Toole clear the backlog of GitHub requests.

O’Toole recently asked him to formally join as part of the newly formed Pango Network Generation Designation Committee, which discusses and decides on various names. Another committee, which includes lab leader Rambout, makes high-level decisions.

“We have a website, and an email that’s not just my email,” says O’Toole. “It has become a lot more formal, and I think it will really help to scale it.”

In the future

The data has started to show some cracks around the edges as it grows. Today, GISID has about 25 million Kovid sequences, with the Pango team divided into 1,300 branches. Each branch corresponds to a variable. Eight of them are to be seen.

With so much to process, the software is shaking. Things are being mismanaged. Many strains look alike, as the virus develops into more frequent and more beneficial mutations.

As a stopgap step, the team has created a new software that uses a different sorting method and can catch things that Pango can miss.