.. _tutorial: ######## Tutorial ######## This part of the documentation shows examples of how to apply Tentacle to quantifying the contents of metagenomic samples. The sections below showcase how Tentacle can be used in different mapping scenarios, depending on the biological question. The tutorials showcased here come with prepared example files that are available for download in section :ref:`download`. Note that the tutorial file contains example files for all tutorial examples showcased below, to provide a complete tutorial experience. This tutorial is available online at: http://bioinformatics.math.chalmers.se/tentacle/tutorial.html Important information about files and file formats ************************************************** The Tentacle pipeline requires three types of data files: * Reads (query sequences) * Reference sequences * Annotation for the reference sequences Reads ===== Reads (or query sequences) are normally metagenomic reads produced with high-throughput sequencing technologies such as Illumina. The reads can be supplied to Tentacle as either FASTQ (with quality scores) or FASTA files. The Tentacle pipeline can figure out what to do with either, but be aware that quality filtering cannot be performed and will be skipped for input files in FASTA format (it lacks quality information). It is important to note that the input read files should preferrably be located on a parallel file system that has high throughput and can handle the high load that will be put on it when Tentacle is running. References ========== Reference sequences are normally some kind of gene or predicted ORF data. Tentacle expects reference sequences to be available in FASTA format, as this is the format used when indexing the reference sequences for coverage calculations. .. note:: It is important that the reference filename ends in '.fasta', as Tentacle expects this filename ending and using reference files not ending in '.fasta' will result in an error. Certain mappers require that the reference sequences are available in a custom database format (e.g. USEARCH uses ``*.udb`` and bowtie2 has several indexes in ``*.bt2.*`` files). It is important to note that you must prepare the reference sequences (FASTA file + potential database-related files) so that they share the same **basename** (i.e. the filename up until the first dot "."). For example, in the USEARCH case, create a tarball containing the following files:: $ ls references.fasta references.udb $ tar -zvcf reference_tarball.tar.gz references.fasta references.udb It does not matter what you call the tarball, as long as the basename for the FASTA file and the database-related files are the same (in the above example the basename is *references*). The basename is important to remember to add to the command line arguments when running Tentacle so that the mapper can find the correct files later, e.g. for USEARCH the command line flag is ``--usearchDBName``. Refer to the command line help for each mapper for their specific flag name. Annotations =========== The annotations are used in Tentacle when producing and computing the coverage output. Tentacle computes the coverage (i.e. how large portions of the reference sequences that have reads aligned to them) and requires the annotation of the reference sequences to produce that output. The annotation file is a simple tab separated text file with one annotated region per line. The format of the annotation file is as follows:: reference_name start end strand annotation [ascii; no space] [int] [int] [+ or -] [ascii; no space] The first few lines of an example annotation file could read:: scaffold3_2 899 3862 + COG3321 scaffold6_1 0 570 + COG0768 scaffold11_2 3 1589 - NOG08628 scaffold13_1 1 260 - NOG21937 scaffold13_1 880 1035 + COG0110 As you can see in the example above, a reference sequence can occur on multiple lines, with different annotations on each line. Note that the start and end coordinates are 0-based (like Python). .. note:: The annotation file is required in order to compute the coverage of each annotated region of the reference sequences. Tentacle will not run without an annotation file. There is a special case where the entire length of each reference sequence is the actual annotated region (e.g. when the reference file contains entire genes). In such cases it is easy to create a dummy annotation file that annotates the entire length of each sequence in the reference FASTA file. Just put a + in the strand column. .. _tutorial1: TUTORIAL 1. Mapping reads to contigs (pBLAT) ********************************************* This mapping scenario is relevant for quantifying the gene content of a complete metagenome. In this tutorial the mapper ``pBLAT`` will be used. However, the techniques displayed in this tutorial applies equally to other mappers that do not require a premade database (i.e. that can map a FASTA/FASTQ reads file to a FASTA reference), such as for example RazerS3. First of all, make sure to install ``pBLAT`` and make the binary available in your ``$PATH``, e.g. by putting it in ``%TENTACLE_VENV/bin``. Step-by-step tutorial ===================== To begin this tutorial, extract the tutorial tarball, available from :ref:`download`. It contains a folder named ``tutorial_1`` which contains the following files that are relevant for this part of the tutorial:: tutorial_1/ tutorial_1/data/annotation_1.tab tab-delimited file with annotation for contigs_1.fasta tutorial_1/data/annotation_2.tab tab-delimited file with annotation for contigs_2.fasta tutorial_1/data/reads_1.fasta reads in FASTA format tutorial_1/data/reads_2.fastq reads in FASTQ format tutorial_1/data/contigs_1.fasta contigs in FASTA format tutorial_1/data/contigs_2.fasta contigs in FASTA format In our example, we are mapping reads from two small sequencing projects back to the contigs that were assembled from the same reads. One of the input read files is in FASTQ format, and one is in FASTA. Step 1: Setting up the mapping manifest --------------------------------------- For Tentacle to know what to do, a *mapping manifest* must be created. The manifest details what reads file should be mapped to what reference using what annotation. By utilizing a mapping manifest file, it is easy to go back to old runs and inspect their mapping manifests to see what was actually run. The format for the mapping manifest is simple; it consists of three columns with absolute paths for the different files in the following order:: {reads} {reference} {annotation} To create a mapping manifest is easy. The simplest way is probably to use the standard GNU tools ``find`` and ``paste``. Assuming you are standing in the ``tutorial_1`` directory it could look like this:: $ find `pwd`/data/r* > tmp_reads $ find `pwd`/data/c* > tmp_references $ find `pwd`/data/a* > tmp_annotations $ paste tmp_reads tmp_references tmp_annotations > mapping_manifest.tab $ rm tmp_* What happens is that ``find`` lists all files matching the pattern ``r*`` in the data directory under our current working directory (``pwd`` returns the absolute path to the current working directory), i.e. all read files in the data directory. We then do the same for the references (contigs in this case) and the annotation files. After we have produced three files containing listings of the absolute paths of all our data files, we paste them together using ``paste`` into a tab separated file ``mapping_manifest.tab``. This technique can easily be extended to add files from different folders by appending (``>>``) to the ``tmp_reads`` for example. There is no need to follow this specific procedure for the creation of the mapping manifest; you are free to use whatever tools or techniques you want for the mapping manifest as long as the end result is the same. It must contain absolute paths to all files and each row should contain three entries with read, reference, and annotation file. Step 2: Run Tentacle on cluster using Slurm ------------------------------------------- .. sidebar:: Running Tentacle locally Tentacle can also be run locally, with several instances of the mapper run simultaneously on your computer. This is not recommended as this is normally not very efficient, because several instances of the mapper will compete for resources (disk I/O, memory, CPU). To run Tentacle locally, call the file ``tentacle_local.py`` instead of ``tentacle_slurm.py``. As ``pBLAT`` is only able to read FASTA format files, the reads file in FASTQ format needs to be converted. Tentacle does this automatically when it detects that we are using a mapper that does not accept FASTQ input. The user does not have to do anything here. For this tutorial we will use the default settings that ``pBLAT`` uses for mapping. For a list of options that can be modified for the specific mapper module used in Tentacle, run Tentacle with the ``--pblat --help`` command line options. For options not available via the mapper module in Tentacle, please refer to ``pBLAT``'s command line help. First of all, make sure that the Python virtualenv that we created in the :ref:`virtualenv` section is activated. Tentacle can be run on the commandline by calling the file ``tentacle_parallel.py``. If you installed Tentacle according to the instructions in :ref:`installation` it should be available in your ``$PATH`` variable as well. The call to Tentacle must minimally include the required command line parameters (in the case for ``pBLAT`` the only extra mapping related parameter required is the mapping manifest). If we use the mapping manifest that we created in Step 1, the command line could look like this:: $ tentacle_slurm.py --pblat --mappingManifest tutorial_1/mapping_manifest.tab --distributionNodeCount 2 --slurmTimeLimit 01:00:00 --slurmAccount ACCOUNT2014-0-000 --slurmPartition glenn A call like this runs Tentacle using the :ref:`slurm launcher`, e.g. in a cluster environment. Read more about running Tentacle in section :ref:`Running Tentacle`. Note that you have to adjust the arguments for parameters ``--slurmAccount``, ``--slurmPartition``, to fit the account and partition names applicable in your specific cluster environment. If you want to try out Tentacle running locally, see the sidebar in this section. Step 3: Check results --------------------- After a successful run, the Tentacle master process shuts down when all nodes have completed their computations. The results are continously written to the output directory (which is either specified when starting the run using the ``--outputDirectory`` command line option or into the default output directory ``tentacle_output``). The output directory contains one folder with log files and one folder with the actual quantification results, as well as a file called ``run_summary.txt`` that shows an overview of all jobs. The Tentacle output format is further detailed in section :ref:`output`. .. _tutorial2: TUTORIAL 2. Mapping nucleotide reads to amino acid database (USEARCH) *********************************************************************** This mapping scenario is common typically when a reference database (ref DB) of known genes exists (e.g. known antibiotic resistance genes). Since all metagenomic samples needs to be compared to the same reference genes, a single ref DB is constructed beforehand. This steps displayed in this tutorial are relevant for other mappers using a premade ref DB such as Bowtie2, GEM, BLAST etc. Introductory remarks ===================== .. sidebar:: Modification of mapper call How the actual commandline is constructed in Tentacle is defined in the mapping modules, in this case ``usearch.py``; the interested reader should have a look there to see how it is constructed. It is available for inspection in :ref:`usearch`. In this example we will use USEARCH as the mapper because of its excellent performance in the nucleotide-to-amino-acid mapping scenario (translated search). As we are only interested in identifying the best matches we will utilize the *usearch_local* algorithm and search both strands of the reads. We are interested in genes with high sequence identity to the references and will only pick the best hit. If we boil it down to what we would run on a single machine, the commandline might look like this:: $ usearch -usearch_local reads.fasta -db references.udb -id 0.9 -strand both -query_cov 1.0 Step-by-step tutorial ===================== To begin this tutorial, extract the tutorial tarball, available from :ref:`download`. It contains a folder called tutorial_2 which contains the following files that are relevant for this part of the tutorial:: tutorial_2/ tutorial_2/data/annotation.tab tab-delimited file with annotation for references.fasta tutorial_2/data/reads_1.fasta reads in FASTA format tutorial_2/data/reads_2.fastq reads in FASTQ format tutorial_2/data/references.fasta references in FASTA format Step 1: Preparing the ref DB ---------------------------- Prior to running Tentacle, we need to prepare the reference sequences into the format that ``usearch`` uses for reference databases: ``udb``. Running the following command in the ``tutorial_2`` directory will produce a ``usearch`` database that we can use:: $ usearch -makeudb_usearch data/references.fasta -output data/references.udb There is one more thing that is required; Tentacle requires both the database file (for ``usearch`` to do its thing) but also the original FASTA file for the references, as this is used when computing the coverage of the reference sequences. So package all of the reference files (database and FASTA) into one *tar.gz* archive so that Tentacle can transfer both of them at once:: $ tar -cvzf data/references.tar.gz data/references* Note how the basename of all files are the same (this is important!). When we are calling Tentacle later, we will have to specify the common basename using the ``--usearchDBName`` command line parameter (see section :ref:`Run Tentacle usearch`). Step 2: Setting up the mapping manifest --------------------------------------- For Tentacle to know what to do, a *mapping manifest* must be created. The manifest details what reads file should be mapped to what reference using what annotation. By utilizing a mapping manifest file, it is easy to go back to old runs and inspect their mapping manifests to see what was actually run. The format for the mapping manifest is simple; it consists of three columns with absolute paths for the different files in the following order:: {reads} {reference} {annotation} To create a mapping manifest is easy. The simplest way is probably to use the standard GNU tools ``find`` and ``paste`` like in the previous example above. However, in the case when a single reference database is to be used there is an extra step to ensure that there are as many lines of with the path to the reference database and the annotation file as there are read files to be mapped. Assuming you are standing in the ``tutorial_2`` directory it could look like this:: $ find `pwd`/data/reads* > tmp_reads $ find `pwd`/data/references.tar.gz | awk '{for(i=0;i<2;i++)print}' > tmp_references $ find `pwd`/data/annotation.tab | awk '{for(i=0;i<2;i++)print}' > tmp_annotations $ paste tmp_reads tmp_references tmp_annotations > mapping_manifest.tab $ rm tmp_* What happens is that ``find`` lists all files matching the pattern ``reads*`` in the data directory under our current working directory (``pwd`` returns the absolute path to the current working directory), i.e. all read files in the data directory. For references and annotations it is a bit different in this use case with a single reference database and accompanying single annotation file. In the example above we pipe the output from ``find`` via ``awk`` to multiply the line with the path to the reference tarball and the annotation file two times so that we can paste all the temporary files together and have one row for each read file. After we have produced three files containing listings of the absolute paths of all our data files, we paste them together using ``paste`` into a tab separated file ``mapping_manifest.tab``. This technique can easily be extend to add files from different folders by appending (``>>``) to the ``tmp_reads`` for example. There is no need to follow this specific procedure for the creation of the mapping manifest; you are free to use whatever tools or techniques you want for the mapping manifest as long as the end result is the same. It must contain absolute paths to all files and each row should contain three entries with read, reference, and annotation file. .. _Run Tentacle usearch: Step 3: Run Tentacle -------------------- In this example we will map reads to a common reference database using the mapper ``usearch``. Assuming we want to find the best alignment for each read to the reference using a 90% identity threshold the commandline for Tentacle/USEARCH could be the following. Assume you are standing in the ``tutorial_2`` directory:: $ tentacle_slurm.py --usearch --usearchDBName references.fasta --usearchID 0.9 --mappingManifest mapping_manifest.tab --distributionNodeCount 2 The call to Tentacle when using ``usearch`` must minimally include the following command line arguments: * ``--mappingManifest`` * ``--usearch`` * ``--usearchDBName`` For more information about the available command line arguments, call Tentacle with the ``--help`` argument to display a list of all available options. Step 4: Check results --------------------- After a successful run, the Tentacle master process shuts down after all nodes have completed computations. The results are continously written to the output directory (which is either specified when starting the run using the ``--outputDirectory`` command line option or into the default output directory ``tentacle_output``). The output directory contains one folder with log files and one folder with the actual quantification results. The Tentacle output format is further detailed in section :ref:`output`. Other mapping scenarios *********************** Different mappers are best suited for different mapping tasks. With Tentacle it is possible to select the mapper that works best for your specific mapping scenario. The table below lists some scenarios and examples of what mappers might be best suited. ============================ ===================== ============================================= Scenario Mapper(s) Comments ============================ ===================== ============================================= Reads to annotated contigs pBLAT, RazerS3 Many small "reference" files, potentially different for each reads file. (e.g. assembled contigs). No precomputed reference DB. Reads to nt reference USEARCH, GEM, Bowtie2 GEM works well with very large reference DBs Reads to aa reference USEARCH BLASTX-like scenario, *translated search* ============================ ===================== =============================================