kpLogo: k-mer probability logo (version 1.0)

   -by Xuebing Wu (wuxbl@wi.mit.edu), Bartel lab, Whitehead Institute

   Identify statistically enriched/depleted short sequences of length k (kmer)
   at every position in a set of aligned sequences, weighted or unweighted. 
   Degenerate nucleotides can be allowed.

Usage: kpLogo inpu_file [options]

Options

Type of analysis
   (default)            activated by default. Identify significant kmers using Binomial test
                        - input can be fasta/raw/tabular format
   -ranked              Wilcoxon rank-sum test (i.e. Mann-Whitney U test) on ranked sequences 
                        - input can be fasta/raw/tabular format, but needs to be sorted
   -weighted            Two-sample stutent's t test on weighted sequences
                        - input can only be tabular format, does not need to be sorted
   -predict prefix      use significant kmers from a previous run (-o prefix) to score input sequences
   -gradient INT        divide the data into ranked fractions and perfrom kpLogo on each fraction
   -simple              only output frequency and information content logo
   -pwm                 visualize a pwm
Input
   -alphabet STR        alphabet for generating kmers, default=ACGT, case insensitive
                        note: 'dna' is equivalent to 'ACGT', 'U' will be converted to 'T'
                              'protein' is equivalent to 'ACDEFGHIKLMNPQRSTVWY'
   -seq INT             for tabular input: sequences are in column INT. Default 1
   -weight INT          for tabular input: weights are in column INT. Default 2
   -region n1,n2        only consider subsequences from position n1 to n2 (start at 1). 
                        non-positive numbers interpreted as distance from the end
   -select a,b,c,...    keep sequences contain any of specified kmers a, b, c, etc
                        kmer format: seq:position:shift, e.g. CNNC:47:0 
   -remove a,b,c,...    remove sequences contain any of specified kmers.
   -cdf a,b,c,...       perform KS test and generate CDF plots for specified kmers (-weighted mode only)
   -fix maxFreq         fix a position with a specific residual if it occurs in more than maxFreq of the sequences
                        fixed residuals will be plotted as 1.1x of hight of the position with the highest total height
Kmer counting
   -k INT               use fixed kmer length INT
   -max_k INT           consider all kmers of length 1,2,...,INT. default=4 
   -shift INT           max shift (to right) allowed for kmer positions
   -max_shift INT       consider shift from 0 to INT, default=0, i.e. no shift
   -degenerate STR      alphabet to use for degenerate kmers. Subset of all possible IUPAC DNA 
                        residuals (ACGTRYMKWSBDHVN, equivalent to 'all'). One can use 
                        ACGTN to search gapped-kmers. Only work for DNA/RNA sequences
   -gapped              allowing gapped kmer, equivalent to '-degenerate ACGTN' 
   -pair                also test all possible pairs of positional monomers
Statistics & output
   -o STR               prefix for all output files, default=kpLogo
   -minCount NUM        minimum number of sequences to have this kmer to include in output
                        if smaller than 1, treat as fraction of sequences (default=5)
   -p FLOAT             p-value cut-off, default=1.01 (i.e. output all possible kmers)
   -pc FLOAT            Bonferroni corrected p-value cut-off, default=0.05
   -FDR                 adjust p value by FDR method ( default is Bonferroni correction)
   -startPos INT        re-number position INT (1,2,3,..) as 1. The position before it will be -1
   -last_residual       use a kmer's last residual position as the kmer's position. Default is first residual
   -pseudo FLOAT        pseudocount added to background counts. default=1.0. Ignored by -markov
   -fontsize INT        font size for plotting sequence logos, default 20
   -colorblind          use colorblind friendly color scheme
   -email STR           send email notification to this address
   -subject STR         email subject (quoted, default='kpLogo job done')
   -content STR         email content (quoted, default='kpLogo job done')
   -small_sample        correct for small sample size
   -plot STR            which statistics to plot: p: raw p-value (default), b: Bonferroni corrected p, f: FDR, s: statisitcs
   -stack_order -1/0/1  stack residuals by frequency (1) or reverse (-1) or alphabet (0)
   -save                save shuffled sequences (*.shuffled.input) or the learned markov model (*.markov.model)
                        in -predict mode, save feature matrix (*.feat.mat)
Background model for unweighted & unranked sequences (ignore if using -ranked or -weighted)
   (default)            compare to the same kmer at other positions 
   -bgfile FILE         background sequence file
   -markov INT          N-th order markov model trained from input or background (with -bgfile)
                        N=0,1,or 2. Default N=1: first order captures upto di-nucleotide bias
   -shuffle N,M         shuffle input N times, preserving M-nucleotide frequency
   -no_bg_trim          no background sequence trimming (-region). valid with -markov and -bgfile