Here are description of all functions located in module “fluff.py”. They are mainly for internal use and in many situations lack proper sanity checks on input data etc. for general usage.
Classifies a sequence as Qnr or not.
Using the domain_score and a hardcoded (or user defined function) to classify a given sequence as Qnr or not.
Input:
sequence_length an integer with the sequence length.
domain_score a float with the domain score for this sequence.
func an optional function to determine classification.
longseqcutoff the classification cutoff (minimum score) for long qnr
sequences.
longseqdef the definition for long sequences.
Returns:
classification a boolean determining whether it should be classified
as Qnr or not.
Errors:
(none)
Cleans up (moves) error.log and formatdb.log to specified tmp directory
Input:
tmpdir name of the temporary directory to which to move the two
log files.
exitcode the exitcode to give the shell, defaults to 0.
Returns:
(none)
Errors:
(none) Will print error messages to stderr if
the files does not exist or could not
be moved.
Takes hmmsearch-files and summarizes the number of nucleotides searched.
Input:
hmmsearchfiles a list of hmmsearchfiles (paths)
Returns:
total_residues an integer with the number of nucleotides searched
Removes unique identifiers appended to sequence IDs in fasta format by ‘uniqueify_seqids’.
Second option is to read a file and de-unique:ify the sequence identifiers found in that file; the file formats understood are regular fasta and clustalw output.
Input:
sequenceIDs list of sequence IDs (parsed from blastclust
output)
readfile boolean determining whether sequenceIDs contains
a filename to read and correct rather than
a list of sequence IDs parsed from blastclust.
Returns:
sequenceIDs list of sequence IDs without unique identifiers
right after the '>' symbol
None if readfile was true nothing is return, instead
changes are written directly to file.
Errors:
ValueError rasied if no unique identifiers could be found
and removed
PathError raised if there was some error regarding
the file to be read.
Retrieves “a little more” than the matching domain hits from the hmmsearch output.
It also classifies the hits and discards those that do not meet the criteria. It prepends the fasta headers with source information according to source file name and folder name.
Input:
hmmfilepath path to hmmsearch output file.
seqid_list a list with sequence IDs to retrieve.
min_score a primitive first classification.
dbpath system path to the source FASTA file/database.
extendleft an integer with the number of amino acids to retrieve
off the lefthand edge of the hmmsearch alignment.
extendright an integer with the number of amino acids to retrieve
off the righthand edge of the hmmsearch alignment.
func (lambda) function for use in classify_qnr, if this is
"" then classify_qnr will use a hardcoded function.
longseqcutoff longseqcutoff for use in classify_qnr.
longseqdef long sequence definition for use in classify_qnr
Returns:
sequences a list of sequences retrieved
message a string with error message (if any)
Errors:
(none)
formatting.
Designed to be used only in the the function retrieve_sequences_from_hmmsearch.
Input:
sequences list of sequences, each in one
complete string with
between identifier line and sequence.
Returns:
outsequences list of sequences with hopefully
better formatting.
Errors:
(none)
it to the MAX_LINES supplied.
It is divided by inserting the sequence ID header “>seqid....” after the maximum sequence length position, thus creating several smaller sequence segments with the same sequence ID and header line. Recommended MAX_LINES is 64 (64 lines of 80 amino acid residues = 5120). As the function splits at “
requires the fasta file to keep sequences on several lines - fasta files with sequences on a single line will NOT work.
Input:
sequencefile filename string.
MAX_LINES maximum number of lines for one sequence, defaults
to 64.
Returns:
(none) Writes directly to disk to 'sequencefile.shortened'.
Errors:
PathError raised if there is something wrong with path
Reads clusters and from file reads complete sequences for the sequence IDs specified in the clusters.
Note that it reads sequences from the filename ‘workdir’/retrieved_sequences.fasta. The function assumes that MAFFT is installed and accessible through the PATH variable or in the current directory. If refseqpath is empty the function does not align against any reference sequences. Has hardcoded the names of the five plasmid mediated Qnr-genes around line 1029.
Input:
clusters nested list stucture with clusters.
resdir directory to output results.
refseqpath path to file with reference qnr sequence in fasta format.
Can be an empty string if no reference sequences are to be
aligned against.
workdir working directory with temporary files
(unique_retrieved_sequences.fasta.shortened).
Returns:
(nothing) On success writes multiple alignments to file:
resdir/cluster*.aligned
2 (int) If there is a MAFFT error.
Errors:
PathError raised if retrieved_sequences.fasta could not be
opened/found.
Parses blastclust output into a nested list structure
Input:
filename filename of blastclust output
Returns:
sequenceIDs list of sequence IDs with unique identifiers right after
the '>' symbol
Errors:
PathError raised if the file does not exists
ValueError rasied if no unique identifiers could be found and removed
Parses and retrieves sequence score, maximal domain score and sequence IDs from an hmmsearch output file.
Input:
file filename string to hmmsearch output
MIN_SCORE a float with minimum score threshold
Returns:
returntuple nested tuple with the following contents;
score_id_tuples, dbpath,
where score_id_tuples contains triplets with
sequence score, domain score, sequence id,
and dbpath contains a string with the path
to the database searched by hmmsearch.
Errors:
ParseError raised if the file does not conform with
hmmsearch output format (i.e. does not begin
with '# hmmsearch ::').
ValueError raised if no sequence in the hmmsearch output
is found with score >= MIN_SCORE.
Retrieves aligned sequence positions from the hmmsearch output file for domains that can be classified according to the classification function.
Uses classify_qnr implicitly.
filepath path to hmmsearch output file.
seqid_list list with sequence IDs to retrieve. min_score the minimum domain score, default=0. func (lambda) function used to classify fragments. longseqcutoff parameter for the classification function. longseqdef parameter for the classification function.
Returns:
seqsnutts a dictionary with sequenceIDs as keys to list containing
tuples with position information for the domain alignment
hit.
Errors:
PathError raised if the hmmsearch output file can
not be found at the specified location.
ValueError raised if no sequences are found in the
hmmsearch output file.
Retrieves complete source sequences to hits in hmmsearch from their source database files.
It also classifies the hits and discards those that do not meet the criteria. It prepends the fasta headers with source information.
Input:
dbpath path to database FASTA file.
seqid_list a list with sequence IDs to retrieve.
domain_scores a list with domain scores for the sequences IDs.
retr_seq_filepath path to output file.
func (lambda) function for use in classify_qnr, if this is
"" then classify_qnr will use a hardcoded function.
longseqcutoff longseqcutoff for use in classify_qnr.
longseqdef long sequence definition for use in classify_qnr
Returns:
sequences a list with sequences
errmessages a list with error messages (if any)
Errors:
(none)
Retrieves sequence(s) from a fasta file.
Python implementation so it is a bit slow, not intended for retrieving sequences from large databases.
Input:
fastapath string with path to fasta file
seqid_list list with sequence IDs to retrieve
Returns:
sequences list of strings in fasta format with all sequences to
retrieve, containing linebreaks after sequence ID headers.
Errors:
PathError raised if the supplied path does not exists or something
related to that.
ValueError raised if no sequences are found in the database.
Retrieves aligned sequence parts from the hmmsearch output file for domains with scores above min_score.
This function potentially requires a lot of memory!
Input:
filepath path to hmmsearch output file.
seqid_list list with sequence IDs to retrieve.
min_score the minimum domain score.
dbpath path do the database where the sequences were
hmmsearched from.
func (lambda) function for use in classify_qnr.
longseqcutoff longseqcutoff for use in classify_qnr.
longseqthresh longseqthresh for use in classify_qnr.
Returns:
seqsnutts a list with sequences parsed from the alignment.
Errors:
PathError raised if the hmmsearch output file can not be found at the
specified location.
ValueError raised if no sequences are found in the hmmsearch output
file.
Run formatdb to create a BLAST database, then run blastclust on that database to cluster all hits.
Input:
filename filename with sequences with unique ids to cluster.
PercentIdentity the percent identity to cluster with, default is 90 %.
CovThreshold the minimum length coverage threshold for clustering,
default is 50 %
numcores optional argument specifying the number of cores to run
blastclust on, default is 4 and 0 means all available.
Returns:
(None) Writes output to a file, 'filename.clusters' that contains
all identified clusters on each row.
Errors:
PathError raied if there is something wrong with the paths to output
or input files.
ValueError raised if there is something wrong
Unique:ify sequences identifiers in fasta formatted sequences by adding an integer right after the sequence identifer at the first space.
Input:
sequences filename with sequences in fasta format to unique:ify
filename the output filename for unique sequences
Returns:
(None) The list of strings in fasta format with all sequences
inputted sequences, now with added [integer] right after
the sequence identifier, is written out to a file for
further usage in the OS environment (i.e. in formatdb)
Errors:
ValueError raised if the input filenames are not valid
PathError raised if the input filename path is invalid