Internal functions in module fluff.py

Here are description of all functions located in module “fluff.py”. They are mainly for internal use and in many situations lack proper sanity checks on input data etc. for general usage.

fluff.classify_qnr(sequence_length, domain_score, func='', longseqcutoff=75, longseqdef=85)[source]

Classifies a sequence as Qnr or not.

Using the domain_score and a hardcoded (or user defined function) to classify a given sequence as Qnr or not.

Input:

sequence_length an integer with the sequence length.
domain_score    a float with the domain score for this sequence.
func            an optional function to determine classification.
longseqcutoff   the classification cutoff (minimum score) for long qnr 
                sequences.
longseqdef      the definition for long sequences.

Returns:

classification  a boolean determining whether it should be classified
                as Qnr or not.

Errors:

(none)
fluff.cleanup(tmpdir, exitcode=0)[source]

Cleans up (moves) error.log and formatdb.log to specified tmp directory

Input:

tmpdir      name of the temporary directory to which to move the two 
            log files.
exitcode    the exitcode to give the shell, defaults to 0.

Returns:

(none)

Errors:

(none)      Will print error messages to stderr if
            the files does not exist or could not
            be moved.
fluff.count_nucleotides_from_hmmsearch(hmmsearchfiles)[source]

Takes hmmsearch-files and summarizes the number of nucleotides searched.

Input:

hmmsearchfiles  a list of hmmsearchfiles (paths)

Returns:

total_residues  an integer with the number of nucleotides searched
fluff.deuniqueify_seqids(sequenceIDs, readfile=False)[source]

Removes unique identifiers appended to sequence IDs in fasta format by ‘uniqueify_seqids’.

Second option is to read a file and de-unique:ify the sequence identifiers found in that file; the file formats understood are regular fasta and clustalw output.

Input:

sequenceIDs list of sequence IDs (parsed from blastclust
            output)
readfile    boolean determining whether sequenceIDs contains
            a filename to read and correct rather than
            a list of sequence IDs parsed from blastclust.

Returns:

sequenceIDs list of sequence IDs without unique identifiers
            right after the '>' symbol
None        if readfile was true nothing is return, instead
            changes are written directly to file.

Errors:

ValueError  rasied if no unique identifiers could be found 
            and removed
PathError   raised if there was some error regarding
            the file to be read.
fluff.extend_sequences_from_hmmsearch(hmmfilepath, seqid_list, min_score, dbpath, extendleft=0, extendright=0, func='', longseqcutoff=75, longseqdef=85)[source]

Retrieves “a little more” than the matching domain hits from the hmmsearch output.

It also classifies the hits and discards those that do not meet the criteria. It prepends the fasta headers with source information according to source file name and folder name.

Input:

hmmfilepath     path to hmmsearch output file.
seqid_list      a list with sequence IDs to retrieve.
min_score       a primitive first classification.
dbpath          system path to the source FASTA file/database.
extendleft      an integer with the number of amino acids to retrieve 
                off the lefthand edge of the hmmsearch alignment.
extendright     an integer with the number of amino acids to retrieve 
                off the righthand edge of the hmmsearch alignment.
func            (lambda) function for use in classify_qnr, if this is 
                "" then classify_qnr will use a hardcoded function.
longseqcutoff   longseqcutoff for use in classify_qnr.
longseqdef      long sequence definition for use in classify_qnr

Returns:

sequences       a list of sequences retrieved
message         a string with error message (if any)

Errors:

(none) 
fluff.fixfasta(sequences)[source]
Takes a list of sequences and tries to correct their

formatting.

Designed to be used only in the the function retrieve_sequences_from_hmmsearch.

Input:

sequences   list of sequences, each in one
            complete string with 
markers
between identifier line and sequence.

Returns:

outsequences   list of sequences with hopefully
               better formatting.

Errors:

(none)     
fluff.limit_sequence_length(sequencefile, MAX_LINES=64)[source]
Takes a sequence file and shortens all sequences

it to the MAX_LINES supplied.

It is divided by inserting the sequence ID header “>seqid....” after the maximum sequence length position, thus creating several smaller sequence segments with the same sequence ID and header line. Recommended MAX_LINES is 64 (64 lines of 80 amino acid residues = 5120). As the function splits at “

” characters it

requires the fasta file to keep sequences on several lines - fasta files with sequences on a single line will NOT work.

Input:

sequencefile    filename string.
MAX_LINES       maximum number of lines for one sequence, defaults 
                to 64.

Returns:

(none)      Writes directly to disk to 'sequencefile.shortened'.

Errors:

PathError   raised if there is something wrong with path
fluff.malign_clusters(clusters, resdir, refseqpath, workdir)[source]

Reads clusters and from file reads complete sequences for the sequence IDs specified in the clusters.

Note that it reads sequences from the filename ‘workdir’/retrieved_sequences.fasta. The function assumes that MAFFT is installed and accessible through the PATH variable or in the current directory. If refseqpath is empty the function does not align against any reference sequences. Has hardcoded the names of the five plasmid mediated Qnr-genes around line 1029.

Input:

clusters    nested list stucture with clusters.
resdir      directory to output results.
refseqpath  path to file with reference qnr sequence in fasta format. 
            Can be an empty string if no reference sequences are to be 
            aligned against.
workdir     working directory with temporary files 
            (unique_retrieved_sequences.fasta.shortened).

Returns:

(nothing)   On success writes multiple alignments to file: 
            resdir/cluster*.aligned
2 (int)     If there is a MAFFT error.  

Errors:

PathError   raised if retrieved_sequences.fasta could not be 
            opened/found.
fluff.parse_blastclust(filename)[source]

Parses blastclust output into a nested list structure

Input:

filename    filename of blastclust output

Returns:

sequenceIDs list of sequence IDs with unique identifiers right after 
            the '>' symbol

Errors:

PathError   raised if the file does not exists
ValueError  rasied if no unique identifiers could be found and removed
fluff.parse_hmmsearch_output(filename, MIN_SCORE=0)[source]

Parses and retrieves sequence score, maximal domain score and sequence IDs from an hmmsearch output file.

Input:

file  filename string to hmmsearch output
MIN_SCORE  a float with minimum score threshold

Returns:

returntuple  nested tuple with the following contents;
     score_id_tuples, dbpath,
     where score_id_tuples contains triplets with
     sequence score, domain score, sequence id,
     and dbpath contains a string with the path
     to the database searched by hmmsearch.

Errors:

ParseError  raised if the file does not conform with
    hmmsearch output format (i.e. does not begin
    with '# hmmsearch ::').
ValueError  raised if no sequence in the hmmsearch output
    is found with score >= MIN_SCORE.
fluff.parse_sequence_positions_from_hmmsearch(filepath, seqid_list, min_score=0, func='', longseqcutoff=75, longseqdef=85)[source]

Retrieves aligned sequence positions from the hmmsearch output file for domains that can be classified according to the classification function.

Uses classify_qnr implicitly.

Input:

filepath path to hmmsearch output file.

seqid_list list with sequence IDs to retrieve. min_score the minimum domain score, default=0. func (lambda) function used to classify fragments. longseqcutoff parameter for the classification function. longseqdef parameter for the classification function.

Returns:

seqsnutts   a dictionary with sequenceIDs as keys to list containing
            tuples with position information for the domain alignment
            hit.

Errors:

PathError   raised if the hmmsearch output file can
            not be found at the specified location.
ValueError  raised if no sequences are found in the
            hmmsearch output file.
fluff.retrieve_sequences_from_db(dbpath, seqid_list, domain_scores, retr_seq_filepath, func='', longseqcutoff=75, longseqdef=85)[source]

Retrieves complete source sequences to hits in hmmsearch from their source database files.

It also classifies the hits and discards those that do not meet the criteria. It prepends the fasta headers with source information.

Input:

dbpath          path to database FASTA file.
seqid_list      a list with sequence IDs to retrieve.
domain_scores   a list with domain scores for the sequences IDs.
retr_seq_filepath   path to output file.
func            (lambda) function for use in classify_qnr, if this is 
                "" then classify_qnr will use a hardcoded function.
longseqcutoff   longseqcutoff for use in classify_qnr.
longseqdef      long sequence definition for use in classify_qnr

Returns:

sequences       a list with sequences
errmessages     a list with error messages (if any)

Errors:

(none) 
fluff.retrieve_sequences_from_fasta(fastapath, seqid_list)[source]

Retrieves sequence(s) from a fasta file.

Python implementation so it is a bit slow, not intended for retrieving sequences from large databases.

Input:

fastapath   string with path to fasta file
seqid_list  list with sequence IDs to retrieve

Returns:

sequences   list of strings in fasta format with all sequences to 
            retrieve, containing linebreaks after sequence ID headers.

Errors:

PathError   raised if the supplied path does not exists or something 
            related to that.
ValueError  raised if no sequences are found in the database.
fluff.retrieve_sequences_from_hmmsearch(filepath, seqid_list, min_score, dbpath, func='', longseqcutoff=75, longseqdef=85)[source]

Retrieves aligned sequence parts from the hmmsearch output file for domains with scores above min_score.

This function potentially requires a lot of memory!

Input:

filepath        path to hmmsearch output file.
seqid_list      list with sequence IDs to retrieve.
min_score       the minimum domain score.
dbpath          path do the database where the sequences were 
                hmmsearched from.
func            (lambda) function for use in classify_qnr.
longseqcutoff   longseqcutoff for use in classify_qnr.
longseqthresh   longseqthresh for use in classify_qnr.

Returns:

seqsnutts   a list with sequences parsed from the alignment.

Errors:

PathError   raised if the hmmsearch output file can not be found at the
            specified location.
ValueError  raised if no sequences are found in the hmmsearch output 
            file.
fluff.run_blastclust(filename, PercentIdentity=90, CovThreshold=50, numcores=4)[source]

Run formatdb to create a BLAST database, then run blastclust on that database to cluster all hits.

Input:

filename        filename with sequences with unique ids to cluster.
PercentIdentity  the percent identity to cluster with, default is 90 %.
CovThreshold    the minimum length coverage threshold for clustering, 
                default is 50 %
numcores        optional argument specifying the number of cores to run 
                blastclust on, default is 4 and 0 means all available.

Returns:

(None)      Writes output to a file, 'filename.clusters' that contains 
            all identified clusters on each row.

Errors:

PathError   raied if there is something wrong with the paths to output 
            or input files.
ValueError  raised if there is something wrong
fluff.uniqueify_seqids(sequences, filename)[source]

Unique:ify sequences identifiers in fasta formatted sequences by adding an integer right after the sequence identifer at the first space.

Input:

sequences   filename with sequences in fasta format to unique:ify
filename    the output filename for unique sequences

Returns:

(None)      The list of strings in fasta format with all sequences 
            inputted sequences, now with added [integer] right after 
            the sequence identifier, is written out to a file for 
            further usage in the OS environment (i.e. in formatdb)

Errors:

ValueError  raised if the input filenames are not valid
PathError   raised if the input filename path is invalid

Previous topic

Source code

Next topic

Acknowledgments

This Page