The application file is qnrsearch.py and can be called at the prompt if installed correctly (either its directory was added to the PATH-variable or a symbolic link to the program was placed in ~/bin). Calling it without any arguments will produce a humble error message along with usage instructions. It tries to conform to standard Linux command line tool usage with optional flags to modify its behavior. The options and how the pipeline works will now be described.
The pipeline can be divided into four main sub-components:
After each of these major steps the pipeline outputs several files so that it is possible to resume long runs that for some reason went wrong, or rerun parts of the pipeline with different settings. This grants the user the choice of running the first step once (most time consuming) and analysing the results by retrieving sequences of different domain scores and look at the clustering separately without having to redo the computationally intensive hmmsearch (depending on the command line switches used, this might not be the only time consuming step).
In the first step there are not much settings to be made by the user, it runs hmmsearch on the databases the user specifies and stores the output. For details on how hmmsearch operates, please refer to the hmmsearch manual.
Throughout the second step there are a lot of things going on “under the hood”. The extraction step focuses on the output from hmmsearch. The pipeline goes through the hits that hmmer found and makes decisions on whether a hmmsearch hit is really interesting through the use of a classification function. Here the user can decide whether to extract the hit as it is aligned to the model by HMMER, to extract “a little more” than what aligned in HMMER by extending the region of sequence to extract from the source database (if available), or finally just extract the entire source sequences. The classification function is specially tailored to identify novel qnr-like genes but can be set to anything the user wants, essentially (see Modifications). On its standard settings the classification function works like this:
Every fragment that matches anything in HMMER has a length, L, and a bit score as a measure of how well it matches the model. The fragment is classified as an interesting sequence if the fragment bitscore is above a threshold determined by the classification function.
It has the following parameters:
k - slope
m - intercept
d - long sequence definition
c - threshold for long sequences
A simplified view of the classification function looks like this (some kind of pseudo Python code going on here):
classify(fragment):
if fragment_score > (fragment_length * k + m) AND fragment_length < d:
return true
if fragment_score > c AND fragment_length >= d:
return true
else:
return false
So if an example fragment produces a score higher than the minimum allowed for the current fragments length it is classified as an interesting hit and is extracted. Such fragments are marked with * in the diagram below. Fragments that are not classified as interesting are marked with an ‘o’ in the diagram below:
CLASSIFICATION FUNCTION
bit score
^
| * *
|
| * *
| /--------------------
| */ o o
| / o slope: k
| / o intercept: m
| /
|--------------------------->
fragment length (L)
The third step then takes the extracted sequences and clusters them using blastclust. For this step the user can modify the clustering parameters to ensure a meaningful clustering.
The fourth and final step then performs multiple alignments of the individual clusters using MAFFT.
The following information is available by running the program with options -h or --help:
Options:
--version show program's version number and exit
-h, --help show this help message and exit
-n N, --numcpu=N integer - the number of CPUs, N, to use for programs
with that ability [default: N=4]
-M MODEL, --hmm=MODEL
the MODEL to use in hmmsearch [default: model.hmm]
-H, --hmmsearchonly boolean - only run hmmsearch on the databases with the
model specified and save results to file. Can be
combined with -E, -L or -A (e.g. -HEL). [default:
False]
-E, --extracthitsonly
boolean - only extract sequences from hmmsearch output
that classify as potential hits and then quit, writing
results to disk for inspection. Can be combined with
-H, -L and -A (e.g. -EL). [default: False]
-L, --clusteringonly boolean - only perform cLustering of previously stored
results, requires retrieved_sequences.fasta and
pickled.hsseq. Can be combined with option -E or -A
(e.g. -ELA). [default: False]
-A, --alignmentonly boolean - only perform alignment of sequences within
previously created clusters. Can be combined with
option -L (e.g. -LA). [default: False]
-p PI, --percentidentity=PI
integer - the percent identity, PI, with which
blastclust will perform the clustering, range 3-100
[default: PI=90]
-C CT, --coveragethreshold=CT
float - the length coverage threshold, CT, used in
blastclust, range 0.1-0.99 [default: CT=0.25]
-k k, --classifyK=k float - modify the classification function parameter
slope [default: k=0.7778]
-m m, --classifyM=m float - modify the classification function parameter
intercept [default: m=-7.89]
-c c, --classifyC=c float - modify the classification function parameter
long sequence fixed cutoff [default: c=109.64]
-d d, --classifyD=d float - modify the classification function parameter
definition of long sequence (i.e. after what fragment
length to use fixed cutoff) [default: d=150.64]
-a PATH, --addrefseq=PATH
add reference database with sequences in FASTA format
at PATH. These sequences are added before clustering
to improve/simplify cluster analysis. [default: not
used]
Advanced options:
Caution: some of the options might negatively affect pipeline
performance!
-s, --noheuristics boolean - turn hmmsearch HEURISTICS OFF (max
Sensitivity) (not recommended) [default: False]
-D, --retrdb boolean - retrieve sequences from original database
and not from hmmsearch output file. Slower and not
'safe' for clustering or alignment. Requires a
cdbfasta index file for each database [default: False]
-l EXTENDLEFT, --extendleft=EXTENDLEFT
integer - number of residues/nucleotides to extend the
HMMER aligned domain with to the left. Requires a
cdbfasta index file for each database [default: not
used]
-r EXTENDRIGHT, --extendright=EXTENDRIGHT
integer - number of residues/nucleotides to extend the
HMMER aligned domain with to the right. Requires a
cdbfasta index file for each database [default: not
used]
--minlength=MINLENGTH
integer - minimum fragment length allowed, anything
below this will be rejected by the classifier
[default: 25]
--minscore=MINSCORE
integer - minimum fragment bit score from HMMER3,
anything below this will be excluded when parsing the
HMMER3 output [default: 0]
Developer options, use at own risk!:
Caution: some of the options might negatively affect program execution
and/or are generally not properly tested!
--hmmsearch_outdir=OUTDIR
modify OUTput DIRectory for hmmsearch [default:
./hmmsearchresults/]
--resdir=RESDIR modify the DIRectory where RESulting cluster files are
written [default: ./results_clusters/]
--alignseqpath=ALIGNSEQPATH
path to fasta file with sequences to align the
clusters against. It is recommended that there are not
more than a couple of sequences in this file. The
subroutine that performs the parsing is hardcoded to
take the five PMQR-variants! [default: False]
It is recommended to run the pipeline in a separate, empty, folder so that resulting files and folders can be created without risk of overwriting anything (the pipeline overwrites existing files without asking). The following folder structure is created upon running the entire pipeline, assuming the program was exectued in an empty directory called ‘pipeline’:
pipeline/
pipeline/hmmsearchresults/
pipeline/results_clusters/
pipeline/pipeline_data/
The three created folders contain, essentially, the information obtained in each of the three major steps of the pipeline (the fourth/alignment step, uses the same output folder as the third/clustering step).
The folder ‘hmmsearchresults’ will contain the results from hmmsearch.
The folder ‘pipeline_data’ will contain the temporary files used in the intermediary step of the pipeline; files to enable resuming/restarting the clustering and multiple alignment step as well as a database file, ‘retrieved_sequences.fasta’, that will contain the sequences that classified as potential hits according to the classification criterion. This file might be of interest to the user at some point of the analysis, especially when investigating the singletons from the clustering results since they do not create cluster files.
The last folder, ‘results_clusters’, will contain two types of files with the results of the entire pipeline run. The files ‘identified_clusters’ and ‘identified_clusters.scores’ contain a flatfile representation of the clusters, with one cluster per line containing sequence identifiers separated by spaces, and very similarily in the other file but here instead the sequences have gotten their domain score appended to the sequence identifier for simple identification of HMMER scores of the sequences in different clusters (note that the reference sequences supplied with the -a or --addrefseq switches are not present in this file). Also in this last folder will be several files containing the sequences for each cluster together with multiple alignments of each cluster.
Additionally, files with clusters aligned against reference genes specified using the command line switch --alignseqpath. This is messy and not recommended.
The databases to be searched must be in FASTA format. To prepare the databases it can be convenient to use the EMBOSS toolset to convert the sequences of your choice to FASTA format and translate all the sequences into amino-acid sequence using the EMBOSS tool ‘transeq’. Below follows a simple example of preparing the database for search with the pipeline. It is assumed that you have downloaded or otherwise acquired a database in a suitable flat-file format (the database, ‘nucleotide’, nt, from GenBank can be downloaded in FASTA format from the NCBI FTP-server).
Assume a directory ‘databases’ contains a subdirectory ‘genbank’ that contains a FASTA format file, ‘nt.fasta’, as the database of interest. Use for example EMBOSS ‘transeq’ to translate the database into all six frames using the bacterial table (number 11). The following command could be run:
[~/databases/genbank/]$ transeq -sequence nt.fasta -outseq nt.pfasta -frame=6 -table=11
If the user ever wishes to retrieve full-length sequences from this translated database (switch -D or --retrseq) or extend the HMMER domain alignments (switches -l, --extendleft or -r, --extendright, an index needs to be created. For this the pipeline uses the program ‘cdbfasta’. Assuming that it is installed properly, creating an index of the database file is as easy as running the following command:
[~/databases/genbank/]$ cdbfasta nt.pfasta
After successful completion of the two above commands the following three files should be available in the ~/databases/ directory:
nt.fasta
nt.pfasta
nt.pfasta.cidx
Note that the creation of the index is not mandatory and the pipeline will run just fine without it. However, to use the advanced functionality to extend alignments or retrieve full-length sequences from their database, the use of cdbfasta/cdbyank is required.