Pipeline ======== Pipeline function ----------------- The application file is ``qnrsearch.py`` and can be called at the prompt if installed correctly (either its directory was added to the PATH-variable or a symbolic link to the program was placed in ``~/bin``). Calling it without any arguments will produce a humble error message along with usage instructions. It tries to conform to standard Linux command line tool usage with optional flags to modify its behavior. The options and how the pipeline works will now be described. Pipeline layout --------------- The pipeline can be divided into four main sub-components: 1. HMMER: runs ``hmmsearch`` on database using a hidden Markov model for the gene(s) of interest. 2. Identification of interesting hits through the use of a classification function that classifies hits based on their length of alignment to the hidden Markov model together with their domain alignment bitscore according to a (user modifiable) function. This step also extracts the sequences, either directly from the alignments produced by ``hmmsearch`` or retrieval of sub-/sequences from the databases used in the search. 3. Cluster the identified sequences using ``blastclust`` 4. Align each cluster to study sequence similarity visually. The clusters can also be aligned against reference gene sequences. After each of these major steps the pipeline outputs several files so that it is possible to resume long runs that for some reason went wrong, or rerun parts of the pipeline with different settings. This grants the user the choice of running the first step once (most time consuming) and analysing the results by retrieving sequences of different domain scores and look at the clustering separately without having to redo the computationally intensive ``hmmsearch`` (depending on the command line switches used, this might not be the only time consuming step). In the first step there are not much settings to be made by the user, it runs ``hmmsearch`` on the databases the user specifies and stores the output. For details on how ``hmmsearch`` operates, please refer to the `hmmsearch manual `_. Throughout the second step there are a lot of things going on "under the hood". The extraction step focuses on the output from ``hmmsearch``. The pipeline goes through the hits that hmmer found and makes decisions on whether a ``hmmsearch`` hit is really interesting through the use of a classification function. Here the user can decide whether to extract the hit as it is aligned to the model by HMMER, to extract "a little more" than what aligned in HMMER by extending the region of sequence to extract from the source database (if available), or finally just extract the entire source sequences. The classification function is specially tailored to identify novel qnr-like genes but can be set to anything the user wants, essentially (see :doc:`modifications`). On its standard settings the classification function works like this: Every fragment that matches anything in HMMER has a length, L, and a bit score as a measure of how well it matches the model. The fragment is classified as an interesting sequence if the fragment bitscore is above a threshold determined by the classification function. It has the following parameters:: k - slope m - intercept d - long sequence definition c - threshold for long sequences A simplified view of the classification function looks like this (some kind of pseudo Python code going on here):: classify(fragment): if fragment_score > (fragment_length * k + m) AND fragment_length < d: return true if fragment_score > c AND fragment_length >= d: return true else: return false So if an example fragment produces a score higher than the minimum allowed for the current fragments length it is classified as an interesting hit and is extracted. Such fragments are marked with * in the diagram below. Fragments that are not classified as interesting are marked with an 'o' in the diagram below:: CLASSIFICATION FUNCTION bit score ^ | * * | | * * | /-------------------- | */ o o | / o slope: k | / o intercept: m | / |---------------------------> fragment length (L) The third step then takes the extracted sequences and clusters them using ``blastclust``. For this step the user can modify the clustering parameters to ensure a meaningful clustering. The fourth and final step then performs multiple alignments of the individual clusters using MAFFT. Command line options -------------------- The following information is available by running the program with options ``-h`` or ``--help``:: Options: --version show program's version number and exit -h, --help show this help message and exit -n N, --numcpu=N integer - the number of CPUs, N, to use for programs with that ability [default: N=4] -M MODEL, --hmm=MODEL the MODEL to use in hmmsearch [default: model.hmm] -H, --hmmsearchonly boolean - only run hmmsearch on the databases with the model specified and save results to file. Can be combined with -E, -L or -A (e.g. -HEL). [default: False] -E, --extracthitsonly boolean - only extract sequences from hmmsearch output that classify as potential hits and then quit, writing results to disk for inspection. Can be combined with -H, -L and -A (e.g. -EL). [default: False] -L, --clusteringonly boolean - only perform cLustering of previously stored results, requires retrieved_sequences.fasta and pickled.hsseq. Can be combined with option -E or -A (e.g. -ELA). [default: False] -A, --alignmentonly boolean - only perform alignment of sequences within previously created clusters. Can be combined with option -L (e.g. -LA). [default: False] -p PI, --percentidentity=PI integer - the percent identity, PI, with which blastclust will perform the clustering, range 3-100 [default: PI=90] -C CT, --coveragethreshold=CT float - the length coverage threshold, CT, used in blastclust, range 0.1-0.99 [default: CT=0.25] -k k, --classifyK=k float - modify the classification function parameter slope [default: k=0.7778] -m m, --classifyM=m float - modify the classification function parameter intercept [default: m=-7.89] -c c, --classifyC=c float - modify the classification function parameter long sequence fixed cutoff [default: c=109.64] -d d, --classifyD=d float - modify the classification function parameter definition of long sequence (i.e. after what fragment length to use fixed cutoff) [default: d=150.64] -a PATH, --addrefseq=PATH add reference database with sequences in FASTA format at PATH. These sequences are added before clustering to improve/simplify cluster analysis. [default: not used] Advanced options: Caution: some of the options might negatively affect pipeline performance! -s, --noheuristics boolean - turn hmmsearch HEURISTICS OFF (max Sensitivity) (not recommended) [default: False] -D, --retrdb boolean - retrieve sequences from original database and not from hmmsearch output file. Slower and not 'safe' for clustering or alignment. Requires a cdbfasta index file for each database [default: False] -l EXTENDLEFT, --extendleft=EXTENDLEFT integer - number of residues/nucleotides to extend the HMMER aligned domain with to the left. Requires a cdbfasta index file for each database [default: not used] -r EXTENDRIGHT, --extendright=EXTENDRIGHT integer - number of residues/nucleotides to extend the HMMER aligned domain with to the right. Requires a cdbfasta index file for each database [default: not used] --minlength=MINLENGTH integer - minimum fragment length allowed, anything below this will be rejected by the classifier [default: 25] --minscore=MINSCORE integer - minimum fragment bit score from HMMER3, anything below this will be excluded when parsing the HMMER3 output [default: 0] Developer options, use at own risk!: Caution: some of the options might negatively affect program execution and/or are generally not properly tested! --hmmsearch_outdir=OUTDIR modify OUTput DIRectory for hmmsearch [default: ./hmmsearchresults/] --resdir=RESDIR modify the DIRectory where RESulting cluster files are written [default: ./results_clusters/] --alignseqpath=ALIGNSEQPATH path to fasta file with sequences to align the clusters against. It is recommended that there are not more than a couple of sequences in this file. The subroutine that performs the parsing is hardcoded to take the five PMQR-variants! [default: False] Pipeline output folder structure -------------------------------- It is recommended to run the pipeline in a separate, empty, folder so that resulting files and folders can be created without risk of overwriting anything (the pipeline overwrites existing files without asking). The following folder structure is created upon running the entire pipeline, assuming the program was exectued in an empty directory called 'pipeline':: pipeline/ pipeline/hmmsearchresults/ pipeline/results_clusters/ pipeline/pipeline_data/ The three created folders contain, essentially, the information obtained in each of the three major steps of the pipeline (the fourth/alignment step, uses the same output folder as the third/clustering step). The folder 'hmmsearchresults' will contain the results from ``hmmsearch``. The folder 'pipeline_data' will contain the temporary files used in the intermediary step of the pipeline; files to enable resuming/restarting the clustering and multiple alignment step as well as a database file, 'retrieved_sequences.fasta', that will contain the sequences that classified as potential hits according to the classification criterion. This file might be of interest to the user at some point of the analysis, especially when investigating the singletons from the clustering results since they do not create cluster files. The last folder, 'results_clusters', will contain two types of files with the results of the entire pipeline run. The files 'identified_clusters' and 'identified_clusters.scores' contain a flatfile representation of the clusters, with one cluster per line containing sequence identifiers separated by spaces, and very similarily in the other file but here instead the sequences have gotten their domain score appended to the sequence identifier for simple identification of HMMER scores of the sequences in different clusters (note that the reference sequences supplied with the ``-a`` or ``--addrefseq`` switches are not present in this file). Also in this last folder will be several files containing the sequences for each cluster together with multiple alignments of each cluster. Additionally, files with clusters aligned against reference genes specified using the command line switch ``--alignseqpath``. This is messy and not recommended. Database preparation -------------------- The databases to be searched must be in FASTA format. To prepare the databases it can be convenient to use the EMBOSS toolset to convert the sequences of your choice to FASTA format and translate all the sequences into amino-acid sequence using the EMBOSS tool 'transeq'. Below follows a simple example of preparing the database for search with the pipeline. It is assumed that you have downloaded or otherwise acquired a database in a suitable flat-file format (the database, 'nucleotide', nt, from GenBank can be downloaded in FASTA format from the NCBI FTP-server). Assume a directory 'databases' contains a subdirectory 'genbank' that contains a FASTA format file, 'nt.fasta', as the database of interest. Use for example EMBOSS 'transeq' to translate the database into all six frames using the bacterial table (number 11). The following command could be run:: [~/databases/genbank/]$ transeq -sequence nt.fasta -outseq nt.pfasta -frame=6 -table=11 If the user ever wishes to retrieve full-length sequences from this translated database (switch ``-D`` or ``--retrseq``) or extend the HMMER domain alignments (switches ``-l``, ``--extendleft`` or ``-r``, ``--extendright``, an index needs to be created. For this the pipeline uses the program 'cdbfasta'. Assuming that it is installed properly, creating an index of the database file is as easy as running the following command:: [~/databases/genbank/]$ cdbfasta nt.pfasta After successful completion of the two above commands the following three files should be available in the ``~/databases/`` directory:: nt.fasta nt.pfasta nt.pfasta.cidx Note that the creation of the index is not mandatory and the pipeline will run just fine without it. However, to use the advanced functionality to extend alignments or retrieve full-length sequences from their database, the use of ``cdbfasta``/``cdbyank`` is required.