GREEN-VARAN workflow
To perform small variants prioritization as described in the GREEN-DB manuscript, GREEN-VARAN needs some annotations to be already present in your input VCF (see Prioritization of small variants)
This Nextflow workflow automates the whole process annotating additional information and then performing greenevaran annotation. The workflow is tested on Nextflow >=v20.10
Usage
The typical usage scenario starts with a VCF file already containing gene consequences annotations from SnpEff or bcftools. Then from the GREEN-VARAN tool main folder, you can perform all annotations using the following command. This will add a minimum set of information to your VCF including:
population allele AF from gnomAD genomes v3.1.1 (GRCh38) or v2.1.1 (GRCh37)
functional regions overlap for TFBS, DNase peaks and UCNE
prediction score values for ncER, FATHMM, ReMM
GREEN-DB information on regulatory variants with prioritization annotation in level mode (see [prioritization modes](https://github.com/edg1983/GREEN-VARAN/tree/master#prioritization-of-small-variants))
nextflow workflow/main.nf \
-profile local \
--input input_file.vcf.gz \
--build GRCh38 \
--out results \
--scores best \
--regions best \
--AF \
--greenvaran_config config/prioritize_smallvars.json \
--greenvaran_dbschema config/greendb_schema_v2.5.json
If requested annotation files are missing, they will be automatically downloaded in the default location (resources folder within the main GREEN-VARAN folder)
Note that --input can accept multiple vcf.gz files using a pattern like inputdir/*.vcf.gz
Add additional custom annotations
If you have additional custom annotation you want to add to your VCF before greenvaran processing they can be configured in a .toml
and then you can pass this file to the workflow using --anno_toml.
A toml file is a annotation configuration file used by the vcfanno tool and is described in `vcfanno repository<https://github.com/brentp/vcfanno>`_
A minimal example is reported below
[[annotation]]
file="ExAC.vcf" #source file
fields = ["AF", "AF_nfe"] #INFO fields to be extracted from source
ops=["self", "max"] #How to treat source values
names=["exac_af", "exac_af_nfe_max"] #names used in the annotated file
[[annotation]]
file="regions_score.bed.gz"
columns = [4, 5] #When using a BED or TSV files you can refer to values by col index
names=["regions_ids", "score_max"]
ops=["uniq","max"]
Resources
To perform annotations GREEN-VARAN Nextflow workflow requires a series of supporting files.
By default, various resources are expected in the resources folder within the main tool folder.
You pass an alternative resource folder using --resource_folder option, bug the same structure is expected in this folder
The expected folder structure is as follows
.
|-- SQlite
| `-- GREEN-DB_v2.5.db
|-- GRCh37
| `-- BED / TSV files used for GRCh37 genome build
`-- GRCh38
`-- BED / TSV files used for GRCh38 genome build
Use the --list_data option to see the full list of available resources and the expected path for each one.
Automated download
A supporting workflow is provided to automate data download for all resources included in the GREEN-DB collection. You can list the available resources and their resulting download location using
nextflow workflow/download.nf --list_data
The recommended set of annotations can be downloaded to the default location using the following command or
you can set an alternative resource folder using --resource_folder option
nextflow workflow/download.nf \
-profile local \
--scores best \
--regions best \
--AF \
--db
Otherwise, single files are available for download from Zenodo repository and all file locations are listed in
the GREENDB_collection.txt file under the resources folder.
Workflow configuration
The workflow has pre-configured profiles for most popular schedulers (sge, lsf, slurm) and also a local profile (local). These profiles determine how many download jobs can be submitted concurrently and the number of threads used for annotation.
You can activate the desired profile using -profile argument when launching the workflow
NB. You need to update the queue name parameter to reflect your local settings, see how to edit the config below
The default settings for each profile are reported below:
Editing the profile configuration
To adjust the configuration you need to edit the nextflow.config file in the workflow folder
The main parameters you may need to adjust are
- ncpus: this controls the number of threads requests for annotation
- max_local_jobs: this controls the max number of concurrent jobs submitted in the local profile (when not submitting a job to a scheduler)
- queue: this is the name of the queue to be used when submitting jobs
Editing the annotation file schema
The annotation file schema contains the expected file names, repositories, and annotation sources.
In case you need to adjust this you can modify the resources.conf file located in workflow/config in the GREEN-VARAN folder.
Available parameters for main workflow
- --input INPUT_VCF
- Input VCF file(s), compressed and indexedYou can input multiple files from a folder using quotes like
--input mypath/*.vcf.gz - --build GENOME_BUILD
- Genome buildAccepted values: [GRCh37, GRCh38]
- --out output_dir
- Output directory
- --scores SCORE_NAME
- Annotate prediction scoresAccepted values: [best, all, name]best: annotate ncER, FATHMM-MKL, ReMMall: annotate all scoresname: annotate only the specified score(s) (can be comma-separated list)
- --regions REGIONS_NAME
- Annotate functional regionsAccepted values: [best, all, name]best: annotate TFBS, DNase, UCNEall: annotate all regionsname: annotate only the specified region(s) (can be comma-separated list)
- --AF
- Annotate global AF from gnomAD genomes
- --greenvaran_config JSON_FILE
- A json config file for GREEN-VARAN tool
- --greenvaran_dbschema JSON_FILE
- A json db schema file for GREEN-VARAN tool
- --nochr
- Chromosome names in the input file do not have chr prefix
- --prioritization_strategy
- Set prioritization strategy [levels, pileup]
- --resource_folder
- Specify a custom folder for the annotation filesDefault is the resources folder in GREEN-VARAN main folder
- --anno_toml TOML_FILE
- A custom toml annotation config file.This file is a toml file as specified by vcfanno toolThis will be added to other annotations defined with scores, regions and AF.
- --list_data
- Output the list of available scores / regions and the expected paths