The GREEN-DB¶
GREEN-DB is a comprehensive collection of potential regulatory regions in the human genome including ~2.4M regions from 16 data sources and covering ~1.5Gb evenly distributed across chromosomes. The regulatory regions are grouped in 5 categories: enhancer, promoter, silencer, bivalent, insulator.
Each region is described by its genomic location, region type, method(s) of detection, data source and closest gene; ~35% of regions are annotated with controlled genes, ~40% with tissue(s) of activity, and ~14% have associated phenotype(s). GREEN-DB is available as an SQLite database and regions information with controlled genes are also provided as extended BED files for easy integration into existing analysis pipelines.
For details on how the database was compiled please refer to the original publication https://doi.org/10.1101/2020.09.17.301960
The GREEN-DB database is available for free for academic use and available for download in a Zenodo repository. The full database is available as SQLite and a summary of region-based information is provided in BED files.
Database region content¶
GRCh38¶
GREEN-DB | N | Bases covered |
---|---|---|
N Enhancer | 1832830 | 1449153178 |
N Promoter | 565323 | 234315553 |
N Silencer | 4302 | 894792 |
N Bivalent | 8409 | 11210309 |
N Insulator | 23 | 17504 |
All regions | 2410887 | 1502180018 |
GRCh37¶
GREEN-DB | N | Bases covered |
---|---|---|
N Enhancer | 1834183 | 1450755698 |
N Promoter | 566102 | 234890654 |
N Silencer | 4306 | 895868 |
N Bivalent | 8413 | 11215000 |
N Insulator | 23 | 17504 |
All regions | 2413027 | 1504116499 |
SQLite database structure¶
The SQLite database contains 16 tables (expected columns are listed in the image):
- GRCh37 / GRCh38 regions
- GREEN-DB regions coordinate; region type; constraint percentile; closest gene symbol, Ensembl ID and distance; PhyloP100 statistics
- Tissues
- tissue(s) of activity for a region or a region-gene interaction
- Genes
- controlled gene(s)
- Methods
- method(s) supporting each region and region-gene interaction. This may correspond to the data source when no specific method information was available.
- Phenotypes
- potentially associated phenotypes
- GRCh37 / GRCh38 TFBS
- transcription factor binding sites
- GRCh37 / GRCh38 DNase
- DNase hypersensitivity peaks
- GRCh37 / GRCh38 dbSuper
- super-enhancers as defined by dbSuper
- GRCh37 / GRCh38 LoF_tolerance
- the probability of LoF tolerance for enhancers
- GRCh37 / GRCh38 UCNE
- ultraconserved noncoding elements
- GRCh37 / GRCh38 TAD
- TAD domains from TADKB
Main tables (regions, tissues, genes and methods) are linked by the unique region ID. Additionally, a unique interaction ID identifies each gene-region pair in the gene table and it’s linked to methods and tissues tables. Linking tables are included that map the overlap between GREEN-DB region IDs and each of TFBS, DNase, dbSuper and LoF_tolerance region IDs, reporting also the fraction of overlap.
The constraint metric¶
For each region we calculated a contraint metric representing the tolerance to genetic variations. Constraint ranges 0-1 with higher values associated to higher level of variation constraint. Regions with high constraint values (especially > 0.9) are more likely to control essential genes and genes involved in human diseases. The constraint value is also higher for genes intolerant to LoF variants according to the gnomAD oe_lof metric
Summary of the building process¶
In GREEN-DB we collected and aggregated information from 17 different sources, including
- 8 previously published curated databases
- 6 experimental datasets from recently published articles
- predicted regulatory regions from 3 different algorithms
Four additional datasets were included to integrate region to gene / phenotype relationships. We also collected additional data useful in evaluating the regulatory role of genomic regions, including - TFBS and DNase peaks - ultraconserved non-coding elements (UCNE) - super-enhancer definitions - enhancer LoF tolerance
Extract database tables¶
Using bash¶
You can extract all tables of the database to tab-separated tables using a bash script. In the following example the db file is provided as argument and all tables are saved as .tsv files in the present folder
dbfile=$1
# obtains all data tables from database
TS=`sqlite3 $1 "SELECT tbl_name FROM sqlite_master WHERE type='table' and tbl_name not like 'sqlite_%';"`
# exports each table to tsv
for T in $TS; do
sqlite3 $1 <<!
.headers on
.mode tabs
.output $T.tsv
select * from $T;
!
done
Using R¶
You can extract tables from the database in R using the RSQLite package. In the example below we extract all tables to data frames in a named list (dbtables)
library("RSQLite")
## connect to the SQLite database
con <- dbConnect(drv=RSQLite::SQLite(), dbname="SQlite/RegulatoryRegions.db")
## list all data tables
tables <- dbListTables(con)
## create a data.frame for each table
for (i in seq(along=tables)) {
dbtables[[tables[i]]] <- dbGetQuery(conn=con, statement=paste("SELECT * FROM '", tables[[i]], "'", sep=""))
}