Pipeline

Pipeline function

The application file is qnrsearch.py and can be called at the prompt if installed correctly (either its directory was added to the PATH-variable or a symbolic link to the program was placed in ~/bin). Calling it without any arguments will produce a humble error message along with usage instructions. It tries to conform to standard Linux command line tool usage with optional flags to modify its behavior. The options and how the pipeline works will now be described.

Pipeline layout

The pipeline can be divided into four main sub-components:

  1. HMMER: runs hmmsearch on database using a hidden Markov model for the gene(s) of interest.
  2. Identification of interesting hits through the use of a classification function that classifies hits based on their length of alignment to the hidden Markov model together with their domain alignment bitscore according to a (user modifiable) function. This step also extracts the sequences, either directly from the alignments produced by hmmsearch or retrieval of sub-/sequences from the databases used in the search.
  3. Cluster the identified sequences using blastclust
  4. Align each cluster to study sequence similarity visually. The clusters can also be aligned against reference gene sequences.

After each of these major steps the pipeline outputs several files so that it is possible to resume long runs that for some reason went wrong, or rerun parts of the pipeline with different settings. This grants the user the choice of running the first step once (most time consuming) and analysing the results by retrieving sequences of different domain scores and look at the clustering separately without having to redo the computationally intensive hmmsearch (depending on the command line switches used, this might not be the only time consuming step).

In the first step there are not much settings to be made by the user, it runs hmmsearch on the databases the user specifies and stores the output. For details on how hmmsearch operates, please refer to the hmmsearch manual.

Throughout the second step there are a lot of things going on “under the hood”. The extraction step focuses on the output from hmmsearch. The pipeline goes through the hits that hmmer found and makes decisions on whether a hmmsearch hit is really interesting through the use of a classification function. Here the user can decide whether to extract the hit as it is aligned to the model by HMMER, to extract “a little more” than what aligned in HMMER by extending the region of sequence to extract from the source database (if available), or finally just extract the entire source sequences. The classification function is specially tailored to identify novel qnr-like genes but can be set to anything the user wants, essentially (see Modifications). On its standard settings the classification function works like this:

Every fragment that matches anything in HMMER has a length, L, and a bit score as a measure of how well it matches the model. The fragment is classified as an interesting sequence if the fragment bitscore is above a threshold determined by the classification function.

It has the following parameters:

k - slope
m - intercept
d - long sequence definition
c - threshold for long sequences

A simplified view of the classification function looks like this (some kind of pseudo Python code going on here):

classify(fragment):
   if fragment_score > (fragment_length * k + m) AND fragment_length <  d:
       return true
   if fragment_score > c AND fragment_length >= d:
       return true
   else:
       return false

So if an example fragment produces a score higher than the minimum allowed for the current fragments length it is classified as an interesting hit and is extracted. Such fragments are marked with * in the diagram below. Fragments that are not classified as interesting are marked with an ‘o’ in the diagram below:

CLASSIFICATION FUNCTION
bit score
   ^
   |          *          *
   |
   |  *            *
   |       /--------------------
   |     */    o           o
   |     /           o              slope: k
   |    /     o                  intercept: m
   |   /
   |--------------------------->
         fragment length (L)

The third step then takes the extracted sequences and clusters them using blastclust. For this step the user can modify the clustering parameters to ensure a meaningful clustering.

The fourth and final step then performs multiple alignments of the individual clusters using MAFFT.

Command line options

The following information is available by running the program with options -h or --help:

Options:
  --version             show program's version number and exit
  -h, --help            show this help message and exit
  -n N, --numcpu=N      integer - the number of CPUs, N, to use for programs
                        with that ability [default: N=4]
  -M MODEL, --hmm=MODEL
                        the MODEL to use in hmmsearch [default: model.hmm]
  -H, --hmmsearchonly   boolean - only run hmmsearch on the databases with the
                        model specified and save results to file. Can be
                        combined with -E, -L or -A (e.g. -HEL). [default:
                        False]
  -E, --extracthitsonly
                        boolean - only extract sequences from hmmsearch output
                        that classify as potential hits and then quit, writing
                        results to disk for inspection. Can be combined with
                        -H, -L and -A (e.g. -EL). [default: False]
  -L, --clusteringonly  boolean - only perform cLustering of previously stored
                        results, requires retrieved_sequences.fasta and
                        pickled.hsseq. Can be combined with option -E or -A
                        (e.g. -ELA). [default: False]
  -A, --alignmentonly   boolean - only perform alignment of sequences within
                        previously created clusters. Can be combined with
                        option -L (e.g. -LA). [default: False]
  -p PI, --percentidentity=PI
                        integer - the percent identity, PI, with which
                        blastclust will perform the clustering, range 3-100
                        [default: PI=90]
  -C CT, --coveragethreshold=CT
                        float - the length coverage threshold, CT, used in
                        blastclust, range 0.1-0.99 [default: CT=0.25]
  -k k, --classifyK=k   float - modify the classification function parameter
                        slope [default: k=0.7778]
  -m m, --classifyM=m   float - modify the classification function parameter
                        intercept [default: m=-7.89]
  -c c, --classifyC=c   float - modify the classification function parameter
                        long sequence fixed cutoff [default: c=109.64]
  -d d, --classifyD=d   float - modify the classification function parameter
                        definition of long sequence (i.e. after what fragment
                        length to use fixed cutoff) [default: d=150.64]
  -a PATH, --addrefseq=PATH
                        add reference database with sequences in FASTA format
                        at PATH. These sequences are added before clustering
                        to improve/simplify cluster analysis. [default: not
                        used]

  Advanced options:
    Caution: some of the options might negatively affect pipeline
    performance!

    -s, --noheuristics  boolean - turn hmmsearch HEURISTICS OFF (max
                        Sensitivity) (not recommended) [default: False]
    -D, --retrdb        boolean - retrieve sequences from original database
                        and not from hmmsearch output file. Slower and not
                        'safe' for clustering or alignment. Requires a
                        cdbfasta index file for each database [default: False]
    -l EXTENDLEFT, --extendleft=EXTENDLEFT
                        integer - number of residues/nucleotides to extend the
                        HMMER aligned domain with to the left. Requires a
                        cdbfasta index file for each database [default: not
                        used]
    -r EXTENDRIGHT, --extendright=EXTENDRIGHT
                        integer - number of residues/nucleotides to extend the
                        HMMER aligned domain with to the right. Requires a
                        cdbfasta index file for each database [default: not
                        used]
    --minlength=MINLENGTH
                        integer - minimum fragment length allowed, anything
                        below this will be rejected by the classifier
                        [default: 25]
    --minscore=MINSCORE
                        integer - minimum fragment bit score from HMMER3,
                        anything below this will be excluded when parsing the
                        HMMER3 output [default: 0]

  Developer options, use at own risk!:
    Caution: some of the options might negatively affect program execution
    and/or are generally not properly tested!

    --hmmsearch_outdir=OUTDIR
                        modify OUTput DIRectory for hmmsearch [default:
                        ./hmmsearchresults/]
    --resdir=RESDIR     modify the DIRectory where RESulting cluster files are
                        written [default: ./results_clusters/]
    --alignseqpath=ALIGNSEQPATH
                        path to fasta file with sequences to align the
                        clusters against. It is recommended that there are not
                        more than a couple of sequences in this file. The
                        subroutine that performs the parsing is hardcoded to
                        take the five PMQR-variants! [default: False]

Pipeline output folder structure

It is recommended to run the pipeline in a separate, empty, folder so that resulting files and folders can be created without risk of overwriting anything (the pipeline overwrites existing files without asking). The following folder structure is created upon running the entire pipeline, assuming the program was exectued in an empty directory called ‘pipeline’:

pipeline/
pipeline/hmmsearchresults/
pipeline/results_clusters/
pipeline/pipeline_data/

The three created folders contain, essentially, the information obtained in each of the three major steps of the pipeline (the fourth/alignment step, uses the same output folder as the third/clustering step).

The folder ‘hmmsearchresults’ will contain the results from hmmsearch.

The folder ‘pipeline_data’ will contain the temporary files used in the intermediary step of the pipeline; files to enable resuming/restarting the clustering and multiple alignment step as well as a database file, ‘retrieved_sequences.fasta’, that will contain the sequences that classified as potential hits according to the classification criterion. This file might be of interest to the user at some point of the analysis, especially when investigating the singletons from the clustering results since they do not create cluster files.

The last folder, ‘results_clusters’, will contain two types of files with the results of the entire pipeline run. The files ‘identified_clusters’ and ‘identified_clusters.scores’ contain a flatfile representation of the clusters, with one cluster per line containing sequence identifiers separated by spaces, and very similarily in the other file but here instead the sequences have gotten their domain score appended to the sequence identifier for simple identification of HMMER scores of the sequences in different clusters (note that the reference sequences supplied with the -a or --addrefseq switches are not present in this file). Also in this last folder will be several files containing the sequences for each cluster together with multiple alignments of each cluster.

Additionally, files with clusters aligned against reference genes specified using the command line switch --alignseqpath. This is messy and not recommended.

Database preparation

The databases to be searched must be in FASTA format. To prepare the databases it can be convenient to use the EMBOSS toolset to convert the sequences of your choice to FASTA format and translate all the sequences into amino-acid sequence using the EMBOSS tool ‘transeq’. Below follows a simple example of preparing the database for search with the pipeline. It is assumed that you have downloaded or otherwise acquired a database in a suitable flat-file format (the database, ‘nucleotide’, nt, from GenBank can be downloaded in FASTA format from the NCBI FTP-server).

Assume a directory ‘databases’ contains a subdirectory ‘genbank’ that contains a FASTA format file, ‘nt.fasta’, as the database of interest. Use for example EMBOSS ‘transeq’ to translate the database into all six frames using the bacterial table (number 11). The following command could be run:

[~/databases/genbank/]$ transeq -sequence nt.fasta -outseq nt.pfasta -frame=6 -table=11

If the user ever wishes to retrieve full-length sequences from this translated database (switch -D or --retrseq) or extend the HMMER domain alignments (switches -l, --extendleft or -r, --extendright, an index needs to be created. For this the pipeline uses the program ‘cdbfasta’. Assuming that it is installed properly, creating an index of the database file is as easy as running the following command:

[~/databases/genbank/]$ cdbfasta nt.pfasta

After successful completion of the two above commands the following three files should be available in the ~/databases/ directory:

nt.fasta
nt.pfasta
nt.pfasta.cidx

Note that the creation of the index is not mandatory and the pipeline will run just fine without it. However, to use the advanced functionality to extend alignments or retrieve full-length sequences from their database, the use of cdbfasta/cdbyank is required.

Table Of Contents

Previous topic

Installation

Next topic

Using the pipeline (example)

This Page