Running ORE
===============================

..

Overview
~~~~~~~~~

.. image:: _static/ore_flowchart.png
   :align: left

Pre-processing data
~~~~~~~~~~~~~~~~~~~

Required inputs

1. Gene transcription start site (TSS) location with expression as Browser Extensible Data (BED)
2. Genotypes as Variant Call Format (VCF)


ORE uses the same inputs as FastQTL_, bgzip_ with bcftools and index with tabix_:

.. code-block:: bash

    # Prepare the VCF
    bgzip test.vcf
    tabix -p vcf test.vcf.gz
    # Prepare the expression BED file
    bgzip test.bed
    tabix -p bed test.bed.gz


Normalize_ the VCF using `bcftools norm`_. Standardize every allele into its most parsimonious and left aligned form (so that appropriate allele frequencies can be obtained from population databases)

.. code-block:: bash

    bcftools norm -f human_g1k_v37.fasta -Oz test.vcf.gz -o test.norm.vcf.gz


.. _bgzip: http://www.htslib.org/doc/bgzip.html
.. _tabix: hhttp://www.htslib.org/doc/tabix.html
.. _FastQTL: fastqtl.sourceforge.net
.. _Normalize: https://genome.sph.umich.edu/wiki/Variant_Normalization
.. _bcftools norm: http://www.htslib.org/doc/bcftools.html


Download population databases
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

1. Register with ANNOVAR_
2. You will get an email for downloading ANNOVAR. Download this and unzip/untar the file using the following commands (replacing USER_KEY with what was given in the email)

.. code-block:: bash

    wget http://www.openbioinformatics.org/annovar/download/USER_KEY/annovar.latest.tar.gz
    tar -xzvf annovar.latest.tar.gz

3. Download the databases (~20 Gb) with these commands


.. code-block:: bash

    cd annovar
    perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar refGene humandb/
    perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar ensGene humandb/
    perl annotate_variation.pl -buildver hg19 -downdb -webfrom annovar gnomad_genome humandb/

4. Set the database directory as an environmental variable

.. code-block:: bash

    DB_DIR="./human_db/"


.. _ANNOVAR: http://www.openbioinformatics.org/annovar/annovar_download_form.php


Run in Python virtual environment
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Among many other benefits, running in a virtual environment allows one to install ORE without administrator privileges (useful for working in shared scientific computing environments).


.. code-block:: bash

    # Create the virtual environment
    virtualenv venv_ore
    # Enter the environment
    source venv_ore/bin/activate
    # make sure that pip points to the virtual environment python directory by install the latest version
    pip install --upgrade pip
    # install ORE
    pip install ore
    # leave the virtual environment
    deactivate
    # re-enter the virtual environment
    source venv_ore/bin/activate


Specify parameters
~~~~~~~~~~~~~~~~~~~

    Required arguments:
      -v VCF, --vcf VCF     Location of VCF file. Must be tabixed!
      -b BED, --bed BED     Gene expression file location. Must be tabixed!

    Optional file locations:
      -o OUTPUT, --output OUTPUT
                            Output prefix
      --outlier_output OUTLIER_OUTPUT
                            Outlier filename
      --enrich_file ENRICH_FILE
                            Output file for enrichment odds ratios and p-values

    Optional outlier arguments:
      --extrema             Only the most extreme value is an outlier
      --distribution DISTRIBUTION
                            Outlier distribution. Options:
                            {normal,rank,custom}
      --threshold THRESHOLD
                            Expression threshold for defining outliers. Must be
                            greater than 0 for normal or (0,0.5)
                            non-inclusive with rank. Ignored with custom
      --max_outliers_per_id MAX_OUTLIERS_PER_ID
                            Maximum number of outliers per ID

    Optional variant-related arguments:
      --af_rare AF_RARE
                            AF cut-off below which a variant is considered rare (space separated list e.g., 0.1 0.05)
      --af_vcf              Use the VCF AF field to define an allele as rare.
      --intracohort_rare_ac INTRACOHORT_RARE_AC
                            Allele COUNT to be used instead of intra-cohort allele
                            frequency. (still uses af_rare for population level AF
                            cut-off)
      --gq GQ
                            Minimum genotype quality each variant in each individual
      --dp DP
                            Minimum depth per variant in each individual
      --aar AAR
                            Alternate allelic ratio for heterozygous variants
                            (provide two space-separated numbers between 0 and 1,
                            e.g., 0.2 0.8)
      --tss_dist TSS_DIST
                            Variants within this distance of the TSS are
                            considered
      --upstream            Only variants UPstream of TSS
      --downstream          Only variants DOWNstream of TSS

    Optional arguments for using ANNOVAR:
      --annovar             Use ANNOVAR to specify allele frequencies and
                            functional class
      --variant_class
                            Only variants in these classes will be considered. Options:
                            {intronic,intergenic,exonic,UTR5,UTR3,splicing,upstream,ncRNA}
      --exon_class
                            Only variants with these exonic impacts will be
                            considered. Options:
                            {nonsynonymous,intergenic,nonframeshift,frameshift,stopgain,stoploss}
      --refgene             Filter on RefGene function.
      --ensgene             Filter on ENSEMBL function.
      --annovar_dir ANNOVAR_DIR
                            Directory of the table_annovar.pl script
      --humandb_dir HUMANDB_DIR
                            Directory of ANNOVAR data (refGene, ensGene, and
                            gnomad_genome)

    optional arguments:
      -h, --help            show this help message and exit
      --version             show program's version number and exit
      --processes PROCESSES
                            Number of CPU processes
      --clean_run           Delete temporary files from the previous run


Run
~~~

Run ORE using the desired parameters. Currently ORE creates many temporary files that allow for faster re-running or picking up in case of a run-time crash or error.


..  Run
    ~~~~~~~~~~~~~~~~~
    Re-run with other parameters
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Plot and interpret results
    ~~~~~~~~~~~~~~~~~~~~~~~~~