Ghost clades: a gazillion taxa detected solely by sequencing DNA from the environment (including dolphins’ mouths)
Yesterday I posted about the discovery of a new member of the archaea that was found by sequencing DNA taken from inside a single eukaryotic dinoflagellate (there were three other species inside or associated with that cell, too). The DNA sequence I talked about belonged to what the authors named Candidatus Sukunaarchaeum mirabile. The circular genome … Continue reading Ghost clades: a gazillion taxa detected solely by sequencing DNA from the environment (including dolphins’ mouths)

Yesterday I posted about the discovery of a new member of the archaea that was found by sequencing DNA taken from inside a single eukaryotic dinoflagellate (there were three other species inside or associated with that cell, too). The DNA sequence I talked about belonged to what the authors named Candidatus Sukunaarchaeum mirabile. The circular genome of this microbe was unique in having the complete genetic apparatus for self replication (unlike viruses), but (unlike most other prokaryotes) had no genes for metabolism. The authors theorize, and I agree, that it is likely some kind of parasite, commensal, or symbiont that is obligately associated with other species. The question is whether, without the ability to metabolize—but with the ability to reproduce—whether Candidatus Sukunaarchaeum mirabile was alive. I have no dog in that fight, but readers differed. It’s bloody hard to define “life”, though I like Richard Dawkin’s c0ncept that life is whatever can evolve via natural selection. And clearly Candidatus Sukunaarchaeum mirabile could.
At least one commenter deemed the DNA sequence of Candidatus Sukunaarchaeum mirabile as an artifact of “DNA contamination”, though I don’t understand how that could happen. Further, the assembly of the DNA into a genome was deemed artifactual, although again, given how the authors did this, and that the DNA was circular, I don’t understand that, either.
But, intrigued, I did a bit of digging. It turns out that there are a ton of organisms, mostly archaea and bacteria, that have been identified solely from their DNA sequences, and they cannot really be artifacts because they fall into a good phylogenetic tree. In addition, they used 16 ribosomal DNA genes, which tend to be clustered together on the chromosomes, and did multiple reads of all sequences to put together overlapping fragments to build coherent genomic sequences.
The object of the paper below was to sample the environment and, without isolating individual organisms, see how many were new to science simply by looking for novel DNA sequences. The summary is in a rather old (2016) paper in Nature Microbiology, and I haven’t looked for any updates. The upshot, which you can see by clicking on the screenshot below or reading the pdf here, is that there are a gazillion new species, mostly prokaryotes (bacteria + archaea) that we didn’t know about before. Indeed, the new species, based on limited sampling, imply that we only know a smallish fraction of the organisms on the planet.
I call these groups “ghost clades” because they are known only from their DNA and not from physical appearance or other evidence.
The method:
The authors got DNA from a variety of locations (indented sections from the paper); bolding is mine:
This study includes 1,011 organisms from lineages for which genomes were not previously available. The organisms were present in samples collected from a shallow aquifer system, a deep subsurface research site in Japan, a salt crust in the Atacama Desert, grassland meadow soil in northern California, a CO2-rich geyser system, and two dolphin mouths. Genomes were reconstructed from metagenomes as described previously. Genomes were only included if they were estimated to be >70% complete based on presence/absence of a suite of 51 single copy genes for Bacteria and 38 single copy genes for Archaea. Genomes were additionally required to have consistent nucleotide composition and coverage across scaffolds, as determined using the ggkbase binning software (ggkbase.berkeley.edu), and to show consistent placement across both SSU rRNA and concatenated ribosomal protein phylogenies.
Note that they looked at only six sites, including, yes, two dolphin mouths. Why the dolphins? I don’t know. At any rate, they they sequenced the hell out of DNA taken from these sites. They didn’t do complete genomic sequencing, but did enough to identify individual species using DNA sequences coding for 16 different ribosomal proteins: well-known genes that produce proteins that are part of the ribosomes—the sites where DNA is translated into other proteins. This was a ton of work because they had to put the separate sequences together into organisms. Here’s their rationale for using rDNA:
To render this tree of life, we aligned and concatenated a set of 16 ribosomal protein sequences from each organism. This approach yields a higher-resolution tree than is obtained from a single gene, such as the widely used 16S rRNA gene. The use of ribosomal proteins avoids artefacts that would arise from phylogenies constructed using genes with unrelated functions and subject to different evolutionary processes. Another important advantage of the chosen ribosomal proteins is that they tend to be syntenic and co-located in a small genomic region in Bacteria and Archaea, reducing binning errors that could substantially perturb the geometry of the tree. Included in this tree is one representative per genus for all genera for which high-quality draft and complete genomes exist (3,083 organisms in total).
The observation that rRNA genes tend to be near each other on the chromosome allows them to get a big chunk of genome. After they sequenced these genes, they concatenated them: putting all 16 genes together into one big sequence. That big sequence was then subject to phylogenetic (“family tree”) analysis, and, lo and behold, below is the tree they got, taken from the paper (click to enlarge):
The groups that were previously unknown as organisms are indicated with red dots, and the top part of the graph comprises bacteria. The archaea are the smaller group of colored taxa at lower left, while the eukaryotic DNA (and organisms) are at lower right. Note that bacteria are by far the most common new taxa they found (red dots), but a lot of archaea were also new. There were, as expected, no new eukaryotes, as we know most of the sequences of their groups. Also, although the authors say they can’t definitively resolve the placement of eukaryotes in the tripartite group, they do say that eukaryotes seem to have arisen from within archaea, and we now know that is true.
What is most striking about the figure above is the huge radiation in purple at upper right, all of which are new taxa (I believe the authors consider them “phyla”). They call this group the Candidate Phyla Radiation, or CPR. It has hundreds of lineages new to science! And many of the archaea were new, too. Altogether, this shows that the diversity of life as judged from DNA sequences in the environment, is far greater than we knew. But we expect that, don’t we? There are so many places bacteria can live, not that many people go looking for new ones, and they are small.
Here’s what you get when you put all the prokaryotic species into a conventional phylogenetic tree with branch lengths (click to enlarge). The CPR of bacteria is in purplish-blue at the bottom, all of which are new.
One final remark. Further “metagenomic” analysis showed that members of the CPR are unusual in that, like the new archaea species I mentioned yesterday, they have relatively small genomes and “restricted metabolic capacities.” None of the CPRs have compete citric acid cycles and also lack respiratory chains and little or no capacity to synthesize amino acids or nucleotides. They must get these things (vital for life) from the environment, which may include these microbes living as parasites or symbionts. (That, of course, would make them harder to detect.) It’s not clear whether this loss of genetic abilities is a secondary reduction of a formerly complete set of abilities, or an early stage of building up metabolism. (Remember that our archaea discussed yesterday had no genes for metabolism.)
Here is the authors’ conclusion:
The tree of life as we know it has dramatically expanded due to new genomic sampling of previously enigmatic or unknown microbial lineages. This depiction of the tree captures the current genomic sampling of life, illustrating the progress that has been made in the last two decades following the first published genome. What emerges from analysis of this tree is the depth of evolutionary history that is contained within the Bacteria, in part due to the CPR, which appears to subdivide the domain. Most importantly, the analysis highlights the large fraction of diversity that is currently only accessible via cultivation-independent genome-resolved approaches.
All I can say are two things. First, there is surely more information now that expands these data, but I had no time last night to read more than this single paper. We may know most of the vertebrates on the planet, but as for insects, invertebrates, and bacteria, well, we don’t know jack. But that’s good! More work needed and cool things to discover!
Second, it’s a good things dolphins don’t brush their teeth. But some of them get help:
@dentistry.everyday Squeaky Clean
Read More