Running ORE¶

Overview¶

Pre-processing data¶

Required inputs

Gene transcription start site (TSS) location with expression as Browser Extensible Data (BED)
Genotypes as Variant Call Format (VCF)

ORE uses the same inputs as FastQTL, bgzip with bcftools and index with tabix:

# Prepare the VCF
bgzip test.vcf
tabix -p vcf test.vcf.gz
# Prepare the expression BED file
bgzip test.bed
tabix -p bed test.bed.gz

Normalize the VCF using bcftools norm. Standardize every allele into its most parsimonious and left aligned form (so that appropriate allele frequencies can be obtained from population databases)

bcftools norm -f human_g1k_v37.fasta -Oz test.vcf.gz -o test.norm.vcf.gz

Download population databases¶

Register with ANNOVAR
You will get an email for downloading ANNOVAR. Download this and unzip/untar the file using the following commands (replacing USER_KEY with what was given in the email)

wget http://www.openbioinformatics.org/annovar/download/USER_KEY/annovar.latest.tar.gz
tar -xzvf annovar.latest.tar.gz

Download the databases (~20 Gb) with these commands

cd annovar
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ensGene humandb/
perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar gnomad_genome humandb/

Set the database directory as an environmental variable

DB_DIR="./human_db/"

Run in Python virtual environment¶

Among many other benefits, running in a virtual environment allows one to install ORE without administrator privileges (useful for working in shared scientific computing environments).

# Create the virtual environment
virtualenv venv_ore
# Enter the environment
source venv_ore/bin/activate
# make sure that pip points to the virtual environment python directory by install the latest version
pip install --upgrade pip
# install ORE
pip install ore
# leave the virtual environment
deactivate
# re-enter the virtual environment
source venv_ore/bin/activate

Specify parameters¶

Required arguments:

-v VCF, --vcf VCF

Location of VCF file. Must be tabixed!

-b BED, --bed BED

Gene expression file location. Must be tabixed!

Optional file locations:

-o OUTPUT, --output OUTPUT

Output prefix

--outlier_output OUTLIER_OUTPUT

Outlier filename

--enrich_file ENRICH_FILE

Output file for enrichment odds ratios and p-values

Optional outlier arguments:

--extrema Only the most extreme value is an outlier

--distribution DISTRIBUTION

Outlier distribution. Options: {normal,rank,custom}

--threshold THRESHOLD

Expression threshold for defining outliers. Must be greater than 0 for normal or (0,0.5) non-inclusive with rank. Ignored with custom

--max_outliers_per_id MAX_OUTLIERS_PER_ID

Maximum number of outliers per ID

Optional variant-related arguments:

--af_rare AF_RARE

AF cut-off below which a variant is considered rare (space separated list e.g., 0.1 0.05)

--af_vcf Use the VCF AF field to define an allele as rare.

--intracohort_rare_ac INTRACOHORT_RARE_AC

Allele COUNT to be used instead of intra-cohort allele frequency. (still uses af_rare for population level AF cut-off)

--gq GQ Minimum genotype quality each variant in each individual

--dp DP Minimum depth per variant in each individual

--aar AAR Alternate allelic ratio for heterozygous variants (provide two space-separated numbers between 0 and 1, e.g., 0.2 0.8)

--tss_dist TSS_DIST

Variants within this distance of the TSS are considered

--upstream Only variants UPstream of TSS

--downstream Only variants DOWNstream of TSS

Optional arguments for using ANNOVAR:

--annovar Use ANNOVAR to specify allele frequencies and functional class

--variant_class

Only variants in these classes will be considered. Options: {intronic,intergenic,exonic,UTR5,UTR3,splicing,upstream,ncRNA}

--exon_class Only variants with these exonic impacts will be considered. Options: {nonsynonymous,intergenic,nonframeshift,frameshift,stopgain,stoploss}

--refgene Filter on RefGene function.

--ensgene Filter on ENSEMBL function.

--annovar_dir ANNOVAR_DIR

Directory of the table_annovar.pl script

--humandb_dir HUMANDB_DIR

Directory of ANNOVAR data (refGene, ensGene, and gnomad_genome)

optional arguments:

-h, --help show this help message and exit

--version show program’s version number and exit

--processes PROCESSES

Number of CPU processes

--clean_run Delete temporary files from the previous run

Run¶

Run ORE using the desired parameters. Currently ORE creates many temporary files that allow for faster re-running or picking up in case of a run-time crash or error.

`-v VCF, --vcf VCF`
	Location of VCF file. Must be tabixed!
`-b BED, --bed BED`
	Gene expression file location. Must be tabixed!

`-o OUTPUT, --output OUTPUT`
	Output prefix
`--outlier_output OUTLIER_OUTPUT`
	Outlier filename
`--enrich_file ENRICH_FILE`
	Output file for enrichment odds ratios and p-values

`--extrema`	Only the most extreme value is an outlier
`--distribution DISTRIBUTION`
	Outlier distribution. Options: {normal,rank,custom}
`--threshold THRESHOLD`
	Expression threshold for defining outliers. Must be greater than 0 for normal or (0,0.5) non-inclusive with rank. Ignored with custom
`--max_outliers_per_id MAX_OUTLIERS_PER_ID`
	Maximum number of outliers per ID

`--af_rare AF_RARE`
	AF cut-off below which a variant is considered rare (space separated list e.g., 0.1 0.05)
`--af_vcf`	Use the VCF AF field to define an allele as rare.
`--intracohort_rare_ac INTRACOHORT_RARE_AC`
	Allele COUNT to be used instead of intra-cohort allele frequency. (still uses af_rare for population level AF cut-off)
`--gq GQ`	Minimum genotype quality each variant in each individual
`--dp DP`	Minimum depth per variant in each individual
`--aar AAR`	Alternate allelic ratio for heterozygous variants (provide two space-separated numbers between 0 and 1, e.g., 0.2 0.8)
`--tss_dist TSS_DIST`
	Variants within this distance of the TSS are considered
`--upstream`	Only variants UPstream of TSS
`--downstream`	Only variants DOWNstream of TSS

`--annovar`	Use ANNOVAR to specify allele frequencies and functional class
`--variant_class`
	Only variants in these classes will be considered. Options: {intronic,intergenic,exonic,UTR5,UTR3,splicing,upstream,ncRNA}
`--exon_class`	Only variants with these exonic impacts will be considered. Options: {nonsynonymous,intergenic,nonframeshift,frameshift,stopgain,stoploss}
`--refgene`	Filter on RefGene function.
`--ensgene`	Filter on ENSEMBL function.
`--annovar_dir ANNOVAR_DIR`
	Directory of the table_annovar.pl script
`--humandb_dir HUMANDB_DIR`
	Directory of ANNOVAR data (refGene, ensGene, and gnomad_genome)

`-h, --help`	show this help message and exit
`--version`	show program’s version number and exit
`--processes PROCESSES`
	Number of CPU processes
`--clean_run`	Delete temporary files from the previous run