Do you have a question regarding or related to the database? Feel free to drop us an e-mail (please put [BiG-FAM HELP] on the subject). Be sure to first check these Frequently Asked Questions to see if your question is already answered there.
Purpose of the database

The secondary metabolism of bacteria, fungi, and plants constitutes a rich source of bioactive compounds of potential pharmaceutical value, comprising biosynthetic pathways of many chemicals that have been and are being utilized in medicine, food manufactoring and agriculture. The genes encoding the biosynthetic pathways responsible for the production of these metabolites are very often spatially clustered on the chromosome; these genomic loci are referred to as "biosynthetic gene clusters" (BGCs). This genetic architecture has opened up the possibility for straightforward detection of specialized metabolic capacities in the form of known and unknown biosynthetic pathways by locating their gene clusters.

With a drop in the costs of sequencing bacterial and fungal genomes (and the ability to reconstruct large numbers of genomes from metagenomes), large numbers of BGCs can now be found in publicly available data. To analyze which BGCs are similar between organisms, several algorithms have been developed to group BGCs into "gene cluster families" (GCFs), which represent groups of gene clusters from different genomes that are genetically similar and are involved in producing the same or similar compounds. Recently, the BiG-SLiCE algorithm was developed, which, for the first time, allowed reconstructing GCFs from all publicly available (meta)genomic data (ref).

The BiG-FAM database contains GCFs calculated by this tool from over 1.2 million genomes, and allows users to easily search and browse them to analyze patterns of biosynthetic diversity across taxa. Additionally, it allows users to query their own BGCs against all GCFs contained in BiG-FAM, in order to see how their BGCs of interest are related to gene clusters from publicly available genomes.

Example use cases

Exploring rhantipeptides BGC diversity

Rhantipeptides (previously known as "SCIFF peptides") are non-bacteriocin post-translationally modified peptides (RiPPs) prevalent in the genus Clostridia (ref), although GC-content analysis indicated that their genes might be horizontally-transferred (ref). Recent analysis shows that these peptides played an important role in regulating cell population, i.e. via quorum sensing mechanism (ref). During our previous effort in charting the global diversity of 1.2 million BGCs, we captured a large group (6,800) of putative rhantipeptide BGCs with diverse patterns of gene neighborhoods flanking the precursor peptides (ref).

To explore this diversity, we can use BiG-FAM’s "GCF search" function and use the two signature domains of this BGC class (AS-TIGR03973 and Radical_SAM) as query baits (Panel A). The search result shows 79 GCFs, each representing a distinct pattern of the BGCs and their distribution across the taxonomy (Panel B). By clicking on the link to each GCF’s detail page, we will be provided the information about the taxonomy, nucleotide length, calculated radius and biosynthetic features shared by BGCs within the GCF (Panel C). Furthermore, a comparative multi-genes visualization of those BGCs provides an all-in-one view on the diversity of gene neighborhoods flanking the rhantipeptide precursor genes (Panel D).

A. By clicking on the "GCF" page link (box 1) from the main menu, users will be provided an interface to search GCF based on multiple criteria, in this case we search for "bacterial GCFs harboring AS-TIGR03973 and Radical_SAM biosynthetic domains in at least ~80% of their BGCs" (box 2). B. After applying the filter function (box 3), BiG-FAM returned a list of 79 GCFs satisfying the criteria. C. Clicking on the "view" button of a GCF (box 4) will take users to a detail page that shows several statistics related to the GCF’s taxonomy, length, and features (domains) distribution. D. In the GCF detail page, users may also choose to view an "arrower" visualization of the BGCs (box 5), which in this case shows the occurrence of neighboring biosynthetic genes (depicted in colored arrows) flanking the queried cysteine-rich precursor + rSAM gene pairs (blue boxes).
GCF analysis on a newly sequenced Streptomyces

Recently, a draft genome has been published (ref) for Streptomyces tunisialbus, a new streptomycete species isolated from the rhizospheric soil of lavender plants (Lavandula officinalis) in Tunisia (ref).

To showcase how BiG-FAM can be used to assess biosynthetic novelty and capture distant relationships of newly sequenced BGCs, we downloaded the assembled genome from ENA (accession: OKRJ01) and uploaded it to antiSMASH web server, returning a unique job id ("bacteria/fungi-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx") which (after the run is done) can then directly be used to perform GCF analysis in BiG-FAM (Panel A). The entire analysis for the 36 antiSMASH-predicted BGCs was completed in less than a minute, resulting in a summary table of the best BGC-to-GCF hit pairs (Panel B).

One interesting BGC in this genome is the complete, 46.5 kilobase-pair long Type-I PKS protocluster from "Region 15.1", which shows an overall low hit rate both in its ClusterBlast and KnownClusterBlast results. A quick look at the GCF analysis result for the BGC shows a significant hit only to one singleton GCF (Panel C), which after a follow-up inspection turned out to be coming from the NCBI-submitted entry of the same genome (accession: GCA_900290435.1), suggesting the novelty of the PKS BGC in question.

Another useful feature is the "tracking" of biosynthetic domains of the query BGC across hundreds to thousands of distant BGCs, showing the domain architectural similarity shared between the genes (Panel D).

A. When users clicked on the "Query" section of the main menu (box 1), they will be an input form to put a finished antiSMASH or fungiSMASH job id in. After pressing "Submit", BiG-FAM will immediately execute (or put into queue) the downloading, preprocessing and GCF matching of all BGCs (i.e. regions) included in the submitted run. B. A list will then be shown with the summary of all best BGC-to-GCF pairings, with distance lower than 900 (original threshold value) highlighted in green, depicting a good match to at least one GCF in the database. A particular query BGC, "Region 15.1" was selected for a detailed look (box 3) as mentioned in the main text. C. A list of five best-matching GCFs and their model’s distances to the query BGC, showing an exact match (d=0) to a singleton GCF from Streptomyces (GCF_24649, box 4) which turned out to be the same BGC from the same genome. Looking at the visualization of the second closest GCF on the list (GCF_06303 with d = 1,609, box 5), we can see D. co-occurrence of protein domains across the distantly related BGCs, where some similar but non-identical PKS genes (longest multi-domain gene in each GCF) seems to act as an "anchor" that defines the GCF. While this group of anchor genes have similar domain architecture to the PKS gene of the queried BGC (box 6), a quick BLASTp analysis against one example gene (box 7) shows only 52.63% similarity, suggesting that the BGC does not actually belong to the GCF.

Frequently Asked Questions (FAQs)

How were the GCFs hosted in BiG-FAM calculated?
GCFs hosted here were reconstructed using the BiG-SLICE algorithm. A description of this algorithm can be found in the BiG-SLiCE preprint.
How do I compare my own BGC against the GCFs in BiG-FAM?
First, you run antiSMASH (or fungiSMASH) on your genome/assembly. It will then provide you with a job ID (i.e. bacteria-3db13cf8-3367-4428-b305-6a3ce6d8bb0e). Once the job finishes (and not before!), you can insert this job ID into the input field on the query page and then click ‘Submit’. After the querying compute finishes, the output page will show you distances to the GCFs most closely related to your query BGC. You can then view these GCFs to study the genetic architectures and taxonomic distribution of the underlying BGCs.
How do I search for BGCs or GCFs with certain characteristics (protein domains, taxonomy, etc.)?
By clicking ‘GCFs’ or ‘BGCs’ in the menu on the left, you can view the BiG-FAM data by GCF or by BGC, respectively. On the top of the page, several ways are provided to filter/search the data, which will be performed when you click ‘Apply’.
How is BiG-FAM related to antiSMASH, MIBiG, BiG-SCAPE and BiG-SLiCE?
MIBiG is a database of BGCs of known functions whereas antiSMASH identifies BGCs, BiG-SCAPE and BiG-SLiCE group them into GCFs. BiG-SCAPE uses sequence similarity networking (network-based, slower but more sensitive in capturing copy number variations, protein similarity and synteny), while BiG-SLiCE uses a vectorization approach (much faster, but currently lacks sensitivity for some classes i.e. RiPPs).
Can I set up a copy of this database on my own (local) servers?
Yes. Just follow the instruction provided in the source code for the BiG-FAM database: All code is freely available under a GNU Affero General Public License v3.0.
What is the privacy policy of antiSMASH concerning the sequence data used for query mode?
The data submission uses the antiSMASH server, where the submitted data are temporarily stored. It thus follows the same privacy policy as antiSMASH. We try to keep this site and the data that it analyzes as safe and secure as possible. Your output files will be deleted from the antiSMASH server within one month. However, sending your data to the web site is at your own risk. If you are concerned about the sensitivity of your data, please make your own local copy of the data and use BIG-SLiCE for querying.
From which studies were the genomes and MAGs used in BiG-FAM sourced?
Please refer to this table.
How do I (bulk) download GCF data from BiG-FAM?
For BiG-FAM version 1.0.0, we used the same data generated from BiG-SLiCE study: (data is licensed under a Creative Commons CC-BY license).
How do I cite BiG-FAM?
Please refer to this page.