kpLogo: k-mer probability logo (version 1.0) -by Xuebing Wu (wuxbl@wi.mit.edu), Bartel lab, Whitehead Institute Identify statistically enriched/depleted short sequences of length k (kmer) at every position in a set of aligned sequences, weighted or unweighted. Degenerate nucleotides can be allowed. Usage: kpLogo inpu_file [options] Options Type of analysis (default) activated by default. Identify significant kmers using Binomial test - input can be fasta/raw/tabular format -ranked Wilcoxon rank-sum test (i.e. Mann-Whitney U test) on ranked sequences - input can be fasta/raw/tabular format, but needs to be sorted -weighted Two-sample stutent's t test on weighted sequences - input can only be tabular format, does not need to be sorted -predict prefix use significant kmers from a previous run (-o prefix) to score input sequences -gradient INT divide the data into ranked fractions and perfrom kpLogo on each fraction -simple only output frequency and information content logo -pwm visualize a pwm Input -alphabet STR alphabet for generating kmers, default=ACGT, case insensitive note: 'dna' is equivalent to 'ACGT', 'U' will be converted to 'T' 'protein' is equivalent to 'ACDEFGHIKLMNPQRSTVWY' -seq INT for tabular input: sequences are in column INT. Default 1 -weight INT for tabular input: weights are in column INT. Default 2 -region n1,n2 only consider subsequences from position n1 to n2 (start at 1). non-positive numbers interpreted as distance from the end -select a,b,c,... keep sequences contain any of specified kmers a, b, c, etc kmer format: seq:position:shift, e.g. CNNC:47:0 -remove a,b,c,... remove sequences contain any of specified kmers. -cdf a,b,c,... perform KS test and generate CDF plots for specified kmers (-weighted mode only) -fix maxFreq fix a position with a specific residual if it occurs in more than maxFreq of the sequences fixed residuals will be plotted as 1.1x of hight of the position with the highest total height Kmer counting -k INT use fixed kmer length INT -max_k INT consider all kmers of length 1,2,...,INT. default=4 -shift INT max shift (to right) allowed for kmer positions -max_shift INT consider shift from 0 to INT, default=0, i.e. no shift -degenerate STR alphabet to use for degenerate kmers. Subset of all possible IUPAC DNA residuals (ACGTRYMKWSBDHVN, equivalent to 'all'). One can use ACGTN to search gapped-kmers. Only work for DNA/RNA sequences -gapped allowing gapped kmer, equivalent to '-degenerate ACGTN' -pair also test all possible pairs of positional monomers Statistics & output -o STR prefix for all output files, default=kpLogo -minCount NUM minimum number of sequences to have this kmer to include in output if smaller than 1, treat as fraction of sequences (default=5) -p FLOAT p-value cut-off, default=1.01 (i.e. output all possible kmers) -pc FLOAT Bonferroni corrected p-value cut-off, default=0.05 -FDR adjust p value by FDR method ( default is Bonferroni correction) -startPos INT re-number position INT (1,2,3,..) as 1. The position before it will be -1 -last_residual use a kmer's last residual position as the kmer's position. Default is first residual -pseudo FLOAT pseudocount added to background counts. default=1.0. Ignored by -markov -fontsize INT font size for plotting sequence logos, default 20 -colorblind use colorblind friendly color scheme -email STR send email notification to this address -subject STR email subject (quoted, default='kpLogo job done') -content STR email content (quoted, default='kpLogo job done') -small_sample correct for small sample size -plot STR which statistics to plot: p: raw p-value (default), b: Bonferroni corrected p, f: FDR, s: statisitcs -stack_order -1/0/1 stack residuals by frequency (1) or reverse (-1) or alphabet (0) -save save shuffled sequences (*.shuffled.input) or the learned markov model (*.markov.model) in -predict mode, save feature matrix (*.feat.mat) Background model for unweighted & unranked sequences (ignore if using -ranked or -weighted) (default) compare to the same kmer at other positions -bgfile FILE background sequence file -markov INT N-th order markov model trained from input or background (with -bgfile) N=0,1,or 2. Default N=1: first order captures upto di-nucleotide bias -shuffle N,M shuffle input N times, preserving M-nucleotide frequency -no_bg_trim no background sequence trimming (-region). valid with -markov and -bgfile