GREEN-VARAN workflow¶
To perform small variants prioritization as described in the GREEN-DB manuscript, GREEN-VARAN need some annotations to be already present in your input VCF (see Prioritization of small variants)
This Nextflow workflow automate the whole process annotating additional information and then performing greenevaran annotation. The workflow is tested on Nextflow >=v20.10
Usage¶
The typical usage scenario start with a VCF file already containing gene consequences annotations from SnpEff or bcftools. Then from the GREEN-VARAN tool main folder you can perform all annotations using the following command. This will add a minimum set of information to you VCF including:
- population allele AF from gnomAD genomes v3.1.1 (GRCh38) or v2.1.1 (GRCh37)
- functional regions overlaps for TFBS, DNase peaks and UCNE
- prediction score values for ncER, FATHMM, ReMM
- GREEN-DB information on regulatory variants with prioritization levels
nextflow workflow/main.nf \
-profile local \
--input input_file.vcf.gz \
--build GRCh38 \
--out results \
--scores best \
--regions best \
--AF \
--greenvaran_config config/prioritize_smallvars.json \
--greenvaran_dbschema config/greendb_schema_v2.5.json
If requested annotation files are missing, they will be automatically downloaded in the default location (resources
folder within the main GREEN-VARAN folder)
Note that --input
can accept multiple vcf.gz files using a pattern like inputdir/*.vcf.gz
Add additional custom annotations¶
If you have additional custom annotation you want to add to your VCF before greenvaran processing they can be configured in a .toml
and then you can pass this file to the workflow using --anno_toml
.
A toml file is a annotation configuration file used by the vcfanno tool and is described in `vcfanno repository<https://github.com/brentp/vcfanno>`_
A minimal example is reported below
[[annotation]]
file="ExAC.vcf" #source file
fields = ["AF", "AF_nfe"] #INFO fields to be extracted from source
ops=["self", "max"] #How to treat source values
names=["exac_af", "exac_af_nfe_max"] #names used in the annotated file
[[annotation]]
file="regions_score.bed.gz"
columns = [4, 5] #When using a BED or TSV files you can refer to values by col index
names=["regions_ids", "score_max"]
ops=["uniq","max"]
Resources¶
To perform annotations GREEN-VARAN Nextflow workflow requires a series of supporting files.
By default, various resources are expected in the resources
folder within the main tool folder.
You pass an alternative resource folder using --resource_folder
option, bug the same structure is expected in this folder
The expected folder structure is as follows
.
|-- SQlite
| `-- GREEN-DB_v2.5.db
|-- GRCh37
| `-- BED / TSV files used for GRCh37 genome build
`-- GRCh38
`-- BED / TSV files used for GRCh38 genome build
Use the --list_data
option to see the full list of available resources and the expected path for each one.
Automated download¶
A supporting workflow is provided to automate data download for all resources included in the GREEN-DB collection. You can list the available resources and their resulting download location using
nextflow workflow/download.nf --list_data
The reccomended set of annotations can be downloaded to the default location using the following command or
you can set an alternative resource folder using --resource_folder
option
nextflow workflow/download.nf \
-profile local \
--scores best \
--regions best \
--AF \
--db
Otherwise, single files are available for download from Zenodo repository and all file locations are listed in
the GREENDB_collection.txt
file under resources folder.
Workflow configuration¶
The workflow has pre-configured profiles for most popular schedulers (sge, lsf, slurm) and also a local profile (local). These profiles determine how many download jobs can be submitted concurrently and the number of threads used for annotation.
You can activate the desired profile using -profile
argument when launching the workflow
NB. You need to update the queue name parameter to reflect your local settings, see how to edit the config below
The default settings for each profile are reported below:
Editing the profile configuration¶
To adjust the configuration you need to edit the nextflow.config
file in the workflow folder
The main parameters you may need to adjust are
- ncpus
: this controls the number of threads request for annotation
- max_local_jobs
: this controls the max number of concurrent jobs submitted in local profile (when not submitting job to a scheduler)
- queue
: this is the name of the queue to be used when submitting jobs
Editing the annotation file schema¶
The annotation file schema contain the expected files names, repositories and annotation sources.
In case you need to adjust this you can modify the resources.conf
file located in workflow/config in the GREEN-VARAN folder.
Available parameters for main workflow¶
--input INPUT_VCF | |
Input VCF file(s), compressed and indexed
You can input multiple files from a folder using quotes like
--input mypath/*.vcf.gz | |
--build GENOME_BUILD | |
Genome build
Accepted values: [GRCh37, GRCh38]
| |
--out output_dir | |
Output directory
| |
--scores SCORE_NAME | |
Annotate prediction scores
Accepted values: [best, all, name]
best: annotate ncER, FATHMM-MKL, ReMM
all: annotate all scores
name: annotate only the specified score(s) (can be comma-separated list)
| |
--regions REGIONS_NAME | |
Annotate functional regions
Accepted values: [best, all, name]
best: annotate TFBS, DNase, UCNE
all: annotate all regions
name: annotate only the specified region(s) (can be comma-separated list)
| |
--AF | Annotate global AF from gnomAD genomes
|
--greenvaran_config JSON_FILE | |
A json config file for GREEN-VARAN tool
| |
--greenvaran_dbschema JSON_FILE | |
A json db schema file for GREEN-VARAN tool
| |
--nochr | Chromosome names in the input file do not have chr prefix
|
--prioritization_strategy | |
Set prioritization strategy [levels, pileup]
| |
--resource_folder | |
Specify a custom folder for the annotation files
Default is the resources folder in GREEN-VARAN main folder
| |
--anno_tom TOML_FILE | |
A custom toml annotation config file.
This file is a toml file as specified by vcfanno tool
This will be added to other annotations defined with scores, regions and AF.
| |
--list_data | Output the list of available scores / regions and the expected paths
|