Pipeline
========

Pipeline function
-----------------
The application file is ``qnrsearch.py`` and can be called at the prompt if 
installed correctly (either its directory was added to the PATH-variable or a 
symbolic link to the program was placed in ``~/bin``). Calling it without any 
arguments will produce a humble error message along with usage instructions.  
It tries to conform to standard Linux command line tool usage with optional 
flags to modify its behavior. The options and how the pipeline works will now 
be described.


Pipeline layout
---------------
The pipeline can be divided into four main sub-components:

1. HMMER: runs ``hmmsearch`` on database using a hidden Markov model for the 
   gene(s) of interest.
2. Identification of interesting hits through the use of a classification 
   function that classifies hits based on their length of alignment to the 
   hidden Markov model together with their domain alignment bitscore according 
   to a (user modifiable) function. This step also extracts the sequences, 
   either directly from the alignments produced by ``hmmsearch`` or retrieval of 
   sub-/sequences from the databases used in the search.  
3. Cluster the identified sequences using ``blastclust``
4. Align each cluster to study sequence similarity visually. The clusters can 
   also be aligned against reference gene sequences.
 
After each of these major steps the pipeline outputs several files so that it is 
possible to resume long runs that for some reason went wrong, or rerun parts of
the pipeline with different settings. This grants 
the user the choice of running the first step once (most time consuming) and 
analysing the results by retrieving sequences of different domain scores and 
look at the clustering separately without having to redo the computationally 
intensive ``hmmsearch`` (depending on the command line switches used, this might 
not be the only time consuming step).

In the first step there are not much settings to be made by the user, it runs
``hmmsearch`` on the databases the user specifies and stores the output. For 
details on how ``hmmsearch`` operates, please refer to the `hmmsearch manual 
<ftp://selab.janelia.org/pub/software/hmmer3/3.0/Userguide.pdf>`_.

Throughout the second step there are a lot of things going on "under the hood". 
The extraction step focuses on the output from ``hmmsearch``. The pipeline goes 
through the hits that hmmer found and makes decisions on whether a ``hmmsearch`` 
hit is really interesting through the use of a classification function. Here 
the user can decide whether to extract the hit as it is aligned to the model by 
HMMER, to extract "a little more" than what aligned in HMMER by extending the 
region of sequence to extract from the source database (if available), or 
finally just extract the entire source sequences. The classification function 
is specially tailored to identify novel qnr-like genes but can be set to 
anything the user wants, essentially (see :doc:`modifications`). On its 
standard settings the classification function works like this:

Every fragment that matches anything in HMMER has a length, L, and a bit score
as a measure of how well it matches the model. The fragment is classified as an
interesting sequence if the fragment bitscore is above a threshold determined by
the classification function.  

It has the following parameters::

  k - slope
  m - intercept
  d - long sequence definition
  c - threshold for long sequences

A simplified view of the classification function looks like this (some kind of
pseudo Python code going on here)::

  classify(fragment):
     if fragment_score > (fragment_length * k + m) AND fragment_length <  d:
         return true
     if fragment_score > c AND fragment_length >= d:
         return true
     else:
         return false

So if an example fragment produces a score higher than the minimum allowed for
the current fragments length it is classified as an interesting hit and is
extracted. Such fragments are marked with * in the diagram below. Fragments
that are not classified as interesting are marked with an 'o' in the diagram
below::

       CLASSIFICATION FUNCTION
       bit score
          ^                            
          |          *          *      
          |                            
          |  *            *            
          |       /--------------------
          |     */    o           o    
          |     /           o              slope: k
          |    /     o                  intercept: m
          |   /                        
          |--------------------------->
                fragment length (L)

The third step then takes the extracted sequences and clusters them using 
``blastclust``. For this step the user can modify the clustering parameters to 
ensure a meaningful clustering. 

The fourth and final step then performs multiple alignments of the individual 
clusters using MAFFT. 
 
Command line options
--------------------
The following information is available by running the program with options ``-h`` or 
``--help``::
  
  Options:
    --version             show program's version number and exit
    -h, --help            show this help message and exit
    -n N, --numcpu=N      integer - the number of CPUs, N, to use for programs
                          with that ability [default: N=4]
    -M MODEL, --hmm=MODEL
                          the MODEL to use in hmmsearch [default: model.hmm]
    -H, --hmmsearchonly   boolean - only run hmmsearch on the databases with the
                          model specified and save results to file. Can be
                          combined with -E, -L or -A (e.g. -HEL). [default:
                          False]
    -E, --extracthitsonly
                          boolean - only extract sequences from hmmsearch output
                          that classify as potential hits and then quit, writing
                          results to disk for inspection. Can be combined with
                          -H, -L and -A (e.g. -EL). [default: False]
    -L, --clusteringonly  boolean - only perform cLustering of previously stored
                          results, requires retrieved_sequences.fasta and
                          pickled.hsseq. Can be combined with option -E or -A
                          (e.g. -ELA). [default: False]
    -A, --alignmentonly   boolean - only perform alignment of sequences within
                          previously created clusters. Can be combined with
                          option -L (e.g. -LA). [default: False]
    -p PI, --percentidentity=PI
                          integer - the percent identity, PI, with which
                          blastclust will perform the clustering, range 3-100
                          [default: PI=90]
    -C CT, --coveragethreshold=CT
                          float - the length coverage threshold, CT, used in
                          blastclust, range 0.1-0.99 [default: CT=0.25]
    -k k, --classifyK=k   float - modify the classification function parameter
                          slope [default: k=0.7778]
    -m m, --classifyM=m   float - modify the classification function parameter
                          intercept [default: m=-7.89]
    -c c, --classifyC=c   float - modify the classification function parameter
                          long sequence fixed cutoff [default: c=109.64]
    -d d, --classifyD=d   float - modify the classification function parameter
                          definition of long sequence (i.e. after what fragment
                          length to use fixed cutoff) [default: d=150.64]
    -a PATH, --addrefseq=PATH
                          add reference database with sequences in FASTA format
                          at PATH. These sequences are added before clustering
                          to improve/simplify cluster analysis. [default: not
                          used]
  
    Advanced options:
      Caution: some of the options might negatively affect pipeline
      performance!
  
      -s, --noheuristics  boolean - turn hmmsearch HEURISTICS OFF (max
                          Sensitivity) (not recommended) [default: False]
      -D, --retrdb        boolean - retrieve sequences from original database
                          and not from hmmsearch output file. Slower and not
                          'safe' for clustering or alignment. Requires a
                          cdbfasta index file for each database [default: False]
      -l EXTENDLEFT, --extendleft=EXTENDLEFT
                          integer - number of residues/nucleotides to extend the
                          HMMER aligned domain with to the left. Requires a
                          cdbfasta index file for each database [default: not
                          used]
      -r EXTENDRIGHT, --extendright=EXTENDRIGHT
                          integer - number of residues/nucleotides to extend the
                          HMMER aligned domain with to the right. Requires a
                          cdbfasta index file for each database [default: not
                          used]
      --minlength=MINLENGTH
                          integer - minimum fragment length allowed, anything
                          below this will be rejected by the classifier
                          [default: 25]
      --minscore=MINSCORE
                          integer - minimum fragment bit score from HMMER3,
                          anything below this will be excluded when parsing the
                          HMMER3 output [default: 0]
  
    Developer options, use at own risk!:
      Caution: some of the options might negatively affect program execution
      and/or are generally not properly tested!
  
      --hmmsearch_outdir=OUTDIR
                          modify OUTput DIRectory for hmmsearch [default:
                          ./hmmsearchresults/]
      --resdir=RESDIR     modify the DIRectory where RESulting cluster files are
                          written [default: ./results_clusters/]
      --alignseqpath=ALIGNSEQPATH
                          path to fasta file with sequences to align the
                          clusters against. It is recommended that there are not
                          more than a couple of sequences in this file. The
                          subroutine that performs the parsing is hardcoded to
                          take the five PMQR-variants! [default: False]


Pipeline output folder structure
--------------------------------
It is recommended to run the pipeline in a separate, empty, folder so 
that resulting files and folders can be created without risk of overwriting 
anything (the pipeline overwrites existing files without asking). The following 
folder structure is created upon running the entire pipeline, assuming the 
program was exectued in an empty directory called 'pipeline'::

     pipeline/
     pipeline/hmmsearchresults/
     pipeline/results_clusters/
     pipeline/pipeline_data/

The three created folders contain, essentially, the information obtained in each 
of the three major steps of the pipeline (the fourth/alignment step, uses the 
same output folder as the third/clustering step). 

The folder 'hmmsearchresults' will contain the results from ``hmmsearch``.

The folder 'pipeline_data' will contain the temporary files used in the 
intermediary step of the pipeline; files to enable resuming/restarting the 
clustering and multiple alignment step as well as a database file, 
'retrieved_sequences.fasta', that will contain the sequences that classified as 
potential hits according to the classification criterion. This file might be of 
interest to the user at some point of the analysis, especially when 
investigating the singletons from the clustering results since they do not 
create cluster files.

The last folder, 'results_clusters', will contain two types of files with the 
results of the entire pipeline run. The files 'identified_clusters' and 
'identified_clusters.scores' contain a flatfile representation of the clusters, 
with one cluster per line containing sequence identifiers separated by spaces, 
and very similarily in the other file but here instead the sequences have 
gotten their domain score appended to the sequence identifier for simple 
identification of HMMER scores of the sequences in different clusters (note 
that the reference sequences supplied with the ``-a`` or ``--addrefseq`` 
switches are not present in this file). Also in this last folder will be 
several files containing the sequences for each cluster together with multiple 
alignments of each cluster.

Additionally, files with clusters aligned against reference genes specified 
using the command line switch ``--alignseqpath``. This is messy and not 
recommended.


Database preparation
--------------------
The databases to be searched must be in FASTA format. To prepare the databases 
it can be convenient to use the EMBOSS toolset to convert the sequences of your
choice to FASTA format and translate all the sequences into amino-acid sequence 
using the EMBOSS tool 'transeq'. Below follows a simple example of preparing 
the database for search with the pipeline. It is assumed that you have 
downloaded or otherwise acquired a database in a suitable flat-file format (the
database, 'nucleotide', nt, from GenBank can be downloaded in FASTA format from
the NCBI FTP-server).
 
Assume a directory 'databases' contains a subdirectory 'genbank' that contains
a FASTA format file, 'nt.fasta', as the database of interest. Use for example 
EMBOSS 'transeq' to translate the database into all six frames using the 
bacterial table (number 11). The following command could be run::

  [~/databases/genbank/]$ transeq -sequence nt.fasta -outseq nt.pfasta -frame=6 -table=11

If the user ever wishes to retrieve full-length sequences from this translated 
database (switch ``-D`` or ``--retrseq``) or extend the HMMER domain alignments 
(switches ``-l``, ``--extendleft`` or ``-r``, ``--extendright``, an index needs 
to be created.  For this the pipeline uses the program 'cdbfasta'. Assuming 
that it is installed properly, creating an index of the database file is as 
easy as running the following command::

  [~/databases/genbank/]$ cdbfasta nt.pfasta

After successful completion of the two above commands the following three files
should be available in the ``~/databases/`` directory::

  nt.fasta 
  nt.pfasta
  nt.pfasta.cidx

Note that the creation of the index is not mandatory and the pipeline will run 
just fine without it. However, to use the advanced functionality to extend 
alignments or retrieve full-length sequences from their database, the use of 
``cdbfasta``/``cdbyank`` is required.