Using the pipeline (tutorial)

This section will describe some examples on how to use the pipeline to search for qnr antibiotic resistance genes in metagenomic data. The following subsections assume that the pipeline has been installed as per instructions (see Installation), and that external software such as the EMBOSS suite (transeq in particular), blastclust and cdbfasta/cdbyank are also available on the command line via the user’s $PATH-variable.

In the following examples the Linux command line prompt is depicted with the current directory prepended to the dollar sign, as such:

[current_directory/]$

Preparing the database

The data needs to be prepared for it to work with the pipeline. In the attached example file there is a FASTA file containing nucleotide sequences of known PMQR genes as well as the nucleotide sequence of pentapeptide protein MfpA from GenBank and a reverse translated variant of the same protein from PDB accession 2BM7_C. Some unknown PMQR sequences are fragmented into shorter sequences.

To prepare the data it needs to be translated into protein sequences in all six reading frames. Make sure you are in the tutorial/database/ directory. I usually use translation table 11 (bacterial) when translating, and store the translated sequences in a *.pfa (protein FASTA) file. This can be done by calling EMBOSS transeq on the FASTA file database.nfa:

[tutorial/database/]$ transeq -frame 6 -table 11 -sequence database.nfa -outseq database.pfa

The database is now translated and to enable the database sequence retrieval features of the pipeline we create a cdbfasta index of the database:

[tutorial/database/]$ cdbfasta database.pfa

The database is now prepared and ready for use.

Searching the database

We can now search the database using the pipeline and the supplied hidden Markov model constructed from the known PMQR sequences. This is best done from a clean directory; the use of the tutorial/search/ directory which contains the hidden Markov model file is recommended for following these examples. The following call will make a default search on the databases using the supplied model that is already available in this directory (model.hmm):

[tutorial/search/]$ qnrpipeline.py ../database/database.pfa

This call assumes there exists a hidden Markov model file, ‘model.hmm’, in the current directory. Take note that in real-life applications you probably would supply at least one option; -M or --hmm (to specify which model to use), since this makes managing what models to use much more easy.

After running a command like the one above the output clusters are written to files in tutorial/search/results_clusters/ for manual inspection and analysis. After running the above command on the supplied example files there should be three clusters in the output folder, and inspecting the file identified_clusters should yield information that there were five groups, two of which were singletons. These groups corresponds to:

cluster1 QnrB
cluster2 QnrA
cluster3 QnrS
singleton QnrC
singleton QnrD

Inspecting the file cluster2.aligned it is evident that there were a couple of sequence fragments from QnrS3 which grouped with the other QnrS sequences. See if you can find where the other unknown fragments come from! Also, note that even if the Mfpa sequence is detected by hmmsearch, it did not classify as qnr according to the classification function and was thus not included in the results.

This is the simplest use of the pipeline and uses all default values. It is possible to specify several database files on the command line if there are many databases the user would like to search, for example (the files called here do not exist in the tutorial folder):

$ qnrpipeline.py /databases/GenBank/*.pfa /databases/CAMERA/*.pfa

Some users might want to specify which model file should be used, and maybe add some reference sequences to the clustering to easier identify what clusters contain what kind of sequences. The following call could be used. The -M switch supplies the path to a different hidden Markov model, -a supplies the path to a FASTA file with protein sequences from known PMQR sequences:

$ qnrpipeline.py -M ~/hmm_models/othermodel.hmm -a ~/qnrsequences/all_known_PMQR.pfa ~/databases/genbank/nt.pfa

Examples of advanced usage

For more advanced usage, the pipeline can be run in a stepwise fashion. Running the first step of the pipeline (hmmsearch) outputs the results from the hmmsearch of all databases supplied at the command line. These results are then stored in ./hmmsearchresults/, one file for each supplied database file. With these results stored, it is now possible to rerun the latter parts of the pipeline to change for example classification or clustering parameters.

Extraction

Assuming the pipeline has been run with the -H or --hmmsearchonly switch before (or without any switch at all), the user could rerun the later steps of the pipeline, for example changing the classification parameters. The following call will use previously stored hmmsearch result files and this time adjust the classification parameters so that we use a regular cutoff instead of the two-part linear classifier that is the default. This means we only select sequences above the minimum domain score threshold of our choice, here the cutoff is 345.0. The following call only runs the last three steps of the pipeline (-ELA means Extract, cLuster, Align, and could be supplied as three individual switches -E -L -A as well) and sets the classification function to be a simple cutoff score at 345.0. The -d switch is required to set the minimum score, from which the cutoff is used. The supplied files required are the hmmsearch output files created in the previous pipeline run:

$ qnrpipeline.py -ELA -d 0 -c 345.0 ./hmmsearchresults/*

Clustering

Another example of advanced usage could be to just rerun the clustering step and change the cluster parameters to get a different clustering of the results. This can be started directly from the clustering step assuming that the pipeline has been run before, either in its entirety or the two first steps (e.g. using the -H and -E switches or both at the same time -HE). The following command changes the clustering parameters from the default values and reruns the pipeline from the extraction step. The -LA switch tells the pipeline to rerun cLustering and Alignment, the -p and -C switches modify the percent identity and length coverage threshold of blastclust, respectively:

$ qnrpipeline.py -LA -p 75 -C 0.50

Sequence extension

This is a feature that requires cdbfasta and cdbyank to be installed and availabe via the user’s Linux environment $PATH variable. It enables the user to instruct the pipeline to try to extend the aligned hits from hmmsearch by going back to the source database and retrive “a little bit more” at the edges of the alignment. If the user instructs the pipeline to extract 25 more amino acid residues on either side, the pipeline will try its best and extend the retrieved sequence by up to 25 amino acid residues if availabe in the source material.

It can be used like this, which will try to extend all hits by 25 amino acid residues in both the left (-l) and right (-r) direction, again assuming there is a file model.hmm in the current directory:

$ qnrpipeline.py -l 25 -r 25 ~/databases/genbank/nt.pfa

Multi CPU information

Note that for both hmmsearch and blastclust has the ability to use standard POSIX threads for multiprocessor speed-ups and that the command line switch -n or --numcpu will allow the user to specify the number of processors/threads available to these parts of the pipeline, the default setting is to use four processors. For the programs’ implementation of multi-threading, please refer to their respective documentation.

Logging

The pipeline will output a log to standard output during run time so that the user can inspect what the running program is up to. A copy of this output is also written to the file qnrsearch.log and stored for future reference. The pipeline will never overwrite an old logfile but instead append the latest run to the end of it (but please note that it WILL overwrite output files).

Table Of Contents

Previous topic

Pipeline

Next topic

Known Issues

This Page