The secondary metabolism of bacteria, fungi, and plants constitutes a rich source of bioactive compounds of potential pharmaceutical value, comprising biosynthetic pathways of many chemicals that have been and are being utilized in medicine, food manufactoring and agriculture. The genes encoding the biosynthetic pathways responsible for the production of these metabolites are very often spatially clustered on the chromosome; these genomic loci are referred to as "biosynthetic gene clusters" (BGCs). This genetic architecture has opened up the possibility for straightforward detection of specialized metabolic capacities in the form of known and unknown biosynthetic pathways by locating their gene clusters.
With a drop in the costs of sequencing bacterial and fungal genomes (and the ability to reconstruct large numbers of genomes from metagenomes), large numbers of BGCs can now be found in publicly available data. To analyze which BGCs are similar between organisms, several algorithms have been developed to group BGCs into "gene cluster families" (GCFs), which represent groups of gene clusters from different genomes that are genetically similar and are involved in producing the same or similar compounds. Recently, the BiG-SLiCE algorithm was developed, which, for the first time, allowed reconstructing GCFs from all publicly available (meta)genomic data (ref).
The BiG-FAM database contains GCFs calculated by this tool from over 1.2 million genomes, and allows users to easily search and browse them to analyze patterns of biosynthetic diversity across taxa. Additionally, it allows users to query their own BGCs against all GCFs contained in BiG-FAM, in order to see how their BGCs of interest are related to gene clusters from publicly available genomes.
Rhantipeptides (previously known as "SCIFF peptides") are non-bacteriocin post-translationally modified peptides (RiPPs) prevalent in the genus Clostridia (ref), although GC-content analysis indicated that their genes might be horizontally-transferred (ref). Recent analysis shows that these peptides played an important role in regulating cell population, i.e. via quorum sensing mechanism (ref). During our previous effort in charting the global diversity of 1.2 million BGCs, we captured a large group (6,800) of putative rhantipeptide BGCs with diverse patterns of gene neighborhoods flanking the precursor peptides (ref).
To explore this diversity, we can use BiG-FAM’s "GCF search" function and use the two signature domains of this BGC class (AS-TIGR03973 and Radical_SAM) as query baits (Panel A). The search result shows 79 GCFs, each representing a distinct pattern of the BGCs and their distribution across the taxonomy (Panel B). By clicking on the link to each GCF’s detail page, we will be provided the information about the taxonomy, nucleotide length, calculated radius and biosynthetic features shared by BGCs within the GCF (Panel C). Furthermore, a comparative multi-genes visualization of those BGCs provides an all-in-one view on the diversity of gene neighborhoods flanking the rhantipeptide precursor genes (Panel D).
Recently, a draft genome has been published (ref) for Streptomyces tunisialbus, a new streptomycete species isolated from the rhizospheric soil of lavender plants (Lavandula officinalis) in Tunisia (ref).
To showcase how BiG-FAM can be used to assess biosynthetic novelty and capture distant relationships of newly sequenced BGCs, we downloaded the assembled genome from ENA (accession: OKRJ01) and uploaded it to antiSMASH web server, returning a unique job id ("bacteria/fungi-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx") which (after the run is done) can then directly be used to perform GCF analysis in BiG-FAM (Panel A). The entire analysis for the 36 antiSMASH-predicted BGCs was completed in less than a minute, resulting in a summary table of the best BGC-to-GCF hit pairs (Panel B).
One interesting BGC in this genome is the complete, 46.5 kilobase-pair long Type-I PKS protocluster from "Region 15.1", which shows an overall low hit rate both in its ClusterBlast and KnownClusterBlast results. A quick look at the GCF analysis result for the BGC shows a significant hit only to one singleton GCF (Panel C), which after a follow-up inspection turned out to be coming from the NCBI-submitted entry of the same genome (accession: GCA_900290435.1), suggesting the novelty of the PKS BGC in question.
Another useful feature is the "tracking" of biosynthetic domains of the query BGC across hundreds to thousands of distant BGCs, showing the domain architectural similarity shared between the genes (Panel D).