The GREEN-DB

GREEN-DB is a comprehensive collection of potential regulatory regions in the human genome including ~2.4M regions from 16 data sources and covering ~1.5Gb evenly distributed across chromosomes. The regulatory regions are grouped in 5 categories: enhancer, promoter, silencer, bivalent, insulator.

Each region is described by its genomic location, region type, method(s) of detection, data source and closest gene; ~35% of regions are annotated with controlled genes, ~40% with tissue(s) of activity, and ~14% have associated phenotype(s). GREEN-DB is available as an SQLite database and regions information with controlled genes are also provided as extended BED files for easy integration into existing analysis pipelines.

For details on how the database was compiled please refer to the original publication https://doi.org/10.1101/2020.09.17.301960

The GREEN-DB database is available for free for academic use and available for download in a Zenodo repository. The full database is available as SQLite and a summary of region-based information is provided in BED files.

Database region content

GRCh38

GREEN-DB	N	Bases covered
N Enhancer	1832830	1449153178
N Promoter	565323	234315553
N Silencer	4302	894792
N Bivalent	8409	11210309
N Insulator	23	17504
All regions	2410887	1502180018

GRCh37

GREEN-DB	N	Bases covered
N Enhancer	1834183	1450755698
N Promoter	566102	234890654
N Silencer	4306	895868
N Bivalent	8413	11215000
N Insulator	23	17504
All regions	2413027	1504116499

Summary statistics on the database

Summary of the database — **Main information in the database**

Gene-region connections — **Summary information on gene-region connections**

SQLite database structure

The SQLite database contains 16 tables (expected columns are listed in the image):

GRCh37 / GRCh38 regions
GREEN-DB regions coordinate; region type; constraint percentile; closest gene symbol, Ensembl ID and distance; PhyloP100 statistics
Tissues
tissue(s) of activity for a region or a region-gene interaction
Genes
controlled gene(s)
Methods
method(s) supporting each region and region-gene interaction. This may correspond to the data source when no specific method information was available.
Phenotypes
potentially associated phenotypes
GRCh37 / GRCh38 TFBS
transcription factor binding sites
GRCh37 / GRCh38 DNase
DNase hypersensitivity peaks
GRCh37 / GRCh38 dbSuper
super-enhancers as defined by dbSuper
GRCh37 / GRCh38 LoF_tolerance
the probability of LoF tolerance for enhancers
GRCh37 / GRCh38 UCNE
ultraconserved noncoding elements
GRCh37 / GRCh38 TAD
TAD domains from TADKB

Main tables (regions, tissues, genes and methods) are linked by the unique region ID. Additionally, a unique interaction ID identifies each gene-region pair in the gene table and it’s linked to methods and tissues tables. Linking tables are included that map the overlap between GREEN-DB region IDs and each of TFBS, DNase, dbSuper and LoF_tolerance region IDs, reporting also the fraction of overlap.

SQlite DB structure — **A schematic representation of GREEN-DB.**

The constraint metric

For each region we calculated a contraint metric representing the tolerance to genetic variations. Constraint ranges 0-1 with higher values associated to higher level of variation constraint. Regions with high constraint values (especially > 0.9) are more likely to control essential genes and genes involved in human diseases. The constraint value is also higher for genes intolerant to LoF variants according to the gnomAD oe_lof metric

Constraint metric distribution — **Constraint values for regions associated to essential/pathogenic genes**

Summary of the building process

In GREEN-DB we collected and aggregated information from 17 different sources, including

8 previously published curated databases
6 experimental datasets from recently published articles
predicted regulatory regions from 3 different algorithms

Four additional datasets were included to integrate region to gene / phenotype relationships. We also collected additional data useful in evaluating the regulatory role of genomic regions, including - TFBS and DNase peaks - ultraconserved non-coding elements (UCNE) - super-enhancer definitions - enhancer LoF tolerance

Build the database — **Summary of the GREEN-DB building process**

Extract database tables

Using bash

You can extract all tables of the database to tab-separated tables using a bash script. In the following example the db file is provided as argument and all tables are saved as .tsv files in the present folder

dbfile=$1

# obtains all data tables from database
TS=`sqlite3 $1 "SELECT tbl_name FROM sqlite_master WHERE type='table' and tbl_name not like 'sqlite_%';"`

# exports each table to tsv
for T in $TS; do
sqlite3 $1 <<!
.headers on
.mode tabs
.output $T.tsv
select * from $T;
!
done

Using R

You can extract tables from the database in R using the RSQLite package. In the example below we extract all tables to data frames in a named list (dbtables)

library("RSQLite")

## connect to the SQLite database
    con <- dbConnect(drv=RSQLite::SQLite(), dbname="SQlite/RegulatoryRegions.db")

## list all data tables
    tables <- dbListTables(con)

## create a data.frame for each table
    for (i in seq(along=tables)) {
            dbtables[[tables[i]]] <- dbGetQuery(conn=con, statement=paste("SELECT * FROM '", tables[[i]], "'", sep=""))
    }