Given a set of sequences of identical length, either unweighted, weighted, or ranked, kpLogo enumerates all possible kmers of userspecified lengths, evaluates their presence at each position in all input sequences, and reports their enrichment and depletion at each position as determined using an appropriate statistical model. To visualize the results, kpLogo generates a unique type of logo plot called the kmer logo, in which at each position the most significant kmer is plotted vertically with the total height scaled to its P value (log10 transformed) or test statistics, as appropriate. In addition to kmer logo, kpLogo also generates logo plots for monomer frequency and information content.
The diagram below highlights two overlapping CNNC motifs at specific positions within primary transcripts of human microRNAs, which are important for microRNA processing (see Examples):
Limitations of current motif tools
Existing motif discovery tools and visualization tools have fundamental limitations, but there are also strong synergy between them:Motif visualization (sequence logos): 1) cannot handle weighted or ranked sequences; 2) models positions with singlenucleotide resolution but lack the ability to model and detect positional interdependence in motifs.
Motif discovery: handles weighted or ranked sequences and models positional interdependence, but typically ignores positional information and thus misses ultrashort motifs (with lengths 14 letters) or other informationpoor motifs whose specificities are conferred by both sequence identity and relative position.
Unique features of kpLogo
kpLogo combines motif discovery with visualization. Compared to other tools that generate sequence logos, kpLogo is unique in several ways:Users paste or upload sequences to the server, choose sequence type (DNA/RNA, protein, or other) and analysis type (unweighted, weighted, or ranked), then submit the job using default setting. Once submitted, users are redirected to the result page which refreshes every second until the job is finished.
Note that user data will be removed 10 days after the job is submitted. Please download the data as soon as it is available. If the user wishes, all data related to the job can be deleted immediately using a button in the result page. No copy of the data will be stored or used for other purposes.
Users are recommended to provide an optional email address to receive notifications. If provided, two emails will be sent: one immediately after submitting the job, and the other after results are available. Both emails contain a link to the result page. The second email contains a PDF file with all logo plots. Note that users need to keep the result page open in a web browser for the second email to be sent. If the result page is closed before the job is finished, the second email will not be sent but users will still be able to open the result page using the link in the first email.Commandline options: When alphabet
is not specified by users, kpLogo uses the default alphabet ACGT
. Note RNA will be converted to DNA by default so for RNA input there is no need to specify alphabet ACGU
. To run kpLogo on protein sequences, please use alphabet protein
, which is equivalent to alphabet ACDEFGHIKLMNPQRSTVWY
. Similarly, to run kpLogo on other types of sequences, use alphabet
to specify all allowed residues.
Tabular / Column file
kpLogo will treat any of the following characters as delimiter: tab ("\t"), comma (","), vertical bar (""), and space (" ").seq 2 weight 3
FASTA
Here is an example of a FASTA file:region
)ranked
) but not
weighted (need to use tabular format)
Commandline options: region start,end
A positional kmer (or simply kmer in this document) is a short sequence of length k associated with a specific position. Throughout this document, a positional kmer has the format of sequence:position:shift
, or simply sequence:position
if no shift is allowed (default). For example, most introns start with a GU dinucleotide, and this dinucleotide motif (k = 2) can be represented as GU:1:0
, i.e. GU at position 1 with no shift. The position of the first residue is used as the position of the kmer.
Biological sequence signals can tolerate some mismatches, leading to degeneracy in the motif. For example, in addition to AAUAAA, AUUAAA is also used as a polyadenylation signal. Therefore a more general form of the polyadenylation signal is AWUAAA
, where W is either A or U. kpLogo supports all degenerate codes defined in the IUPAC code for nucleic acids (see table below).
The default of the webserver is to allow N (any of A/C/G/T/U) in the middle of a kmer, which represents a gapped kmer (degenerate ACGTN
or simply gapped
). Users can choose to not include degenerate residues (default of commandline), use all IUPACdefined residues (degenerate all
), or only allow a subset of IUPAC residues by choosing 'Other' in the dropdown menu.
For example, to allow only purine (R = A or G) and pyrimidine (Y = C or T), enter ACGTRY
in the text box, including
nondegenerate residues.
Symbol  Description  Bases represented  
A  Adenine  A  1 

C  Cytosine  C 

G  Guanine  G 

T  Thymine  T 

U  Uracil  U 

W  Weak  A  
T 
2 

S  Strong  
C 
G 

M  aMino  A  C  
K  Keto  
G 
T 

R  puRine  A  G 

Y  pYrimidine  
C 
T 

B  not A (B comes
after A) 

C 
G 
T 
3 
D  not C (D comes
after C) 
A 

G 
T 

H  not G (H comes
after G) 
A 
C 

T 

V  not T (V comes
after T and U) 
A 
C 
G 


N  No idea 
A 
C 
G 
T 
4 
In many cases, sequence motifs are well positioned but can tolerate some offset. Such positioning flexibility is modeled by the positional shift in kpLogo. For example, the canonical polyadenylation signal, AAUAAA, typically locates about 18 to 23 nucleotides upstream of the cleavage site, can be represented as AAUAAA:23:5
, i.e. the hexamer motif can be found at position 23, allowing a shift of 5, which means the motif can also occur at 5 other positions: 22, 21, 20, 19, and 18. In kpLogo, the position of a kmer is defined as the position of the first residue in the kmer, and positional shift is only allowed downstream of the start site, i.e. shift to the right. Positional shift is an experimental feature still under development. For example, currently it is not fully supported by some background models (See below).
By default no positional shift is allowed (shift = 0). In the webserver users can choose to use a shift up to 6. The commandline tool allows larger shifts.
kpLogo allows user to filter input sequences by the presence (select
) or absence (remove
) of a list of positional kmers. This feature enables one to study dependence between positional kmers.
A positional kmer is specified in the form of sequence:position:shift, or sequence:position using a default shift of 0. For example, GT:3:0 or GT:3 means GT starting at position 3 with no shift. Note that the sequence can contain degenerate residues. Position coordinates start at 1 for the first residue in each sequence.
To filter by multiple positional kmers, separate each by a comma (,). For example, GT:3:0,ACT:11:0,GGG:45:0 (no space!). Note that when multiple positional kmers are specified, the presence of any positional kmer is defined as a match.
Here are two more examples:
select G:35:3,CNC:47:0
Select (keep) sequences with G at position 35, 36, 37, or 38 (note shift = 3), OR with CNC at position 47. Here N means any nucleotide, thus can be any of CAC, CGC, CTC, and CCC. This will remove sequences that contain none of these positional kmers.
remove G:35:0,CNC:47:0
Remove sequences containing either G at position 35, or CNC at position 47. Sequences that contain neither of these two kmers will be analyzed by kpLogo.
For a given value of k, kpLogo enumerates all possible kmers (allowing degenerate residues when specified), determines their presence / absence at every possible positions in all input sequences, and reports their enrichment / depletion using statistical models specified by users.
ranked
nor weighted
is specified, kpLogo assumes the input is unweighted. kpLogo uses onesided binomial tests
to determine the significance of a kmer at a position. Specifically, the number of trials is the total number of sequences, and the number of success is the number of sequences containing this kmer at this position. The hypothesized probability of successes, i.e. the expected / background probability of a sequence containing this kmer at this position, can be calculated in several ways as specified by users (see details below). A zscore is also reported in the output, assuming normal approximation. The sign of the zscore also indicates enrichment or depletion.
Background
sequences. Users can provide a set of background sequences (bgfile
)
that can be aligned in the same way. In this case, the expected
probability of a sequence with a particular kmer at a particular
position is estimated by the fraction of background sequences with
this kmer at this position.
The potential advantage of using carefully selected background
sequences is that it may enrich desired signals by normalizing out
known or even unknown sequence bias. However, a background
sequence set may not be available or not large enough to for reliable
estimation of the expected probability of success. A small set of background
sequences can lead to noisy output, especially for longer kmers. To avoid a zero estimation of expected probability, a pseudocount is added to kmer counts, which can be changed using pseudo
. The default value of the pseudocount is 1
.
shuffle n,m
(shuffle each sequence n
times
preserving m
residue frequency). kpLogo uses the ushuffle
code to shuffle input sequences while preserving the frequency of
all
possible residue combinations of length specified by users. For
example,
in many genomic applications, DNA sequences are shuffled while
preserving dinucleotide frequencies. Each sequence can be shuffled
multiple times (default = 100) to enable reliable estimation of the
background probability of sequences containing the kmer at a
particular
position. However, this will lead to a linear increase in run
time. In
addition, for relatively short sequences, it is impossible to
preserve
even dinucleotide frequencies. As when using background sequences, a pseudocount is used to avoid zero estimation of expected probability.
Markov
model from input sequences.
This is activated by markov
specifying the order of the Markov model, which summarizes residue
frequency and dependence between neighboring residues in input
sequences. Currently kpLogo supports zero, first, and secondorder
Markov
models. The hypothesized probability of a sequence containing a
particular kmer at a particular position is then determined by the Markov model.
For example, the firstorder Markov model learned from input
sequences
includes all mono and diresidue frequencies, which directly serve
as
the hypothesized probabilities for kmers of length 1 and 2. For
longer
kmers, the hypothesized probability is calculated using the
property of
Markov models. For example, the probability of observing the kmer
ABCD at a particular position is
Markov
model from background sequences.
This is activated when both markov
and bgfile
are specified. The only difference from markov
alone is that
the Markov model is learned from background sequences specified by bgfile
.
Average across all positions (default).
This is activated by default if none of the above models is
specified (bgfile, shuffle, markov
).
It is preferred because it is fast, less noisy, and handles both
positional
shift and degenerate residues. In this model, the hypothesized
probability of a sequence with a certain kmer at a certain
position is
the average observed success rate (fraction of input sequences
containing this kmer at a given position) averaged over all
positions.
weighted
, kpLogo
searches for kmers at certain positions whose presence / absence
in
a sequence is associated with higher / lower weights, respectively.
For
each kmer at each position, all input sequences are divided into
two
sets: onesided twosample Student's t test
is used to compare the weights of the two sets of sequences. The
sign
of the t statistics determines whether it is an enrichment
(sequences
with this kmer at this position have larger weight) or depletion.ranked
), kpLogo assumes the input sequences (FASTA
or
Tabular) are ranked, and enrichment / depletion of a kmer at a
position
means sequences with this kmer at this position tend to be ranked
at
the top / bottom, respectively. Wilcoxon ranksum test
,
also called the MannWhitney U test
,
is used to determine the significance of positional enrichment /
depletion of a kmer. Assuming sample size larger than 8, the
distribution of the U statistics is approximated by a normal distribution. The resultant zscore and P value from the normal
approximation is reported in the output. Again the sign of the
zscore
determines enrichment or depletion.ranked
, which
will
ignore weight
if specified. Also note that ranked sequences need to be
sorted prior to kpLogo analysis.
Min count/fraction (minCount
): minimum number or fraction of sequences to have this kmer to include in output. Default is 5. If a value between 0 and 1 is used, it will be interpreted as the minimum fraction.
Pseudocount (pseudo
): pseudocount added to background counts. default = 1. Ignored by Markov models markov
Smallsample correction (small
): used for generating information content logos. See here.
kpLogo outputs both sequence logo plots and tabular data files.
In frequency logos, residues are scaled relative to their frequencies at each position, and then stacked on top of each other with the more frequent residue on top of less frequent residue.
Similar to frequency logos, in an informationcontent logo (conventional sequence logo) residues are scaled by frequency and stacked, and the total height of each position is further scaled relative to the information content, which is defined as
Here N is the number of residues in the alphabet (all possible residues that can occur at a position), and pi is the frequency of residue i at this position.
Probability logo is first proposed and implemented in pLogo for unweighted sequences, especially protein sequences. kpLogo extends the concept to ranked and weighted sequences. In probability logos, residues are scaled relative to the statistical significance (log10(P value)) of each residue at each position. Enriched residues stack on the top, whereas depleted residues stack on the bottom. Significant positions have coordinates colored in red.
kpLogo generates a new type of logo called kmer logo to visualize motifs in addition to single residues. Stacked residues at a position represent a single motif (the most significant) starting (or ending) at this position. The total height is scaled relative to the significance of this motif. At each position, two motifs are shown, the most enriched on the top, and the most depleted on the bottom. Both read from top to bottom.
In probability logo and kmer logo plots, positions containing residues with Bonferroni corrected P value smaller than a predefined cutoff will be highlighted by red coordinates. The default cutoff is 0.01 in the webserver.
In many applications the input sequences may contain almost invariant residues at certain positions, such as GG in Cas9 guideRNA target sequences (see example below). The statistical model would not work well for extremely imbalanced samples, or the pvalues are so small that it is hard to see signal at other positions. kpLogo allows users to fix positions where a single residue occurs in more than 75% of the sequences. This maximum frequency can be changed using fix
option. P values for fixed positions will be ignored for probability logos, and the height of the fixed residue will be 1.1 times of the max total height of nonfixed positions. The coordinates for fixed positions will be highlighted by black coordinates.
In all logo plots other than kmer logo, multiple residues can stack on top of each other. By default, residues with larger weights will be placed on top of residues with smaller weights. This can be reversed with stack_order 1
. To ignore weight information and use order specified by alphabet, use stack_order 0
.
By default residues or kmers are scaled by log10 transformed P values. To highlight more the significant motifs, one can plot Bonferroni corrected P values. One can also plot test statistics. In cases where P values are too small to be represented in the computer system (P < 1E308 ), kpLogo will automatically switch to test statistics.
In many cases it makes more sense to call some position other than the first residue as position 1. If this is the case the positions before it will be numbered as 1, 2, etc (no 0).
Regardless the type of input (unweighted/ranked/weighted), the first six columns of the output data file are always the same:
Output from ranked inputs has no other columns other than the six above. Note the statistics and P value in column 4 and 5 are based on normal approximation. See:
Unweighted inputs will generate four additional columns in the output:
For typical DNA/RNA analysis, the job should finish within a minute. Run time increases dramatically with increased kmer length, and even more with protein sequences. The program imposes a limit of 10 million tests per job, which equals the number of kmers multiplied by the number of positions (length of each sequence). On the webserver, the largest number of kmers to be tested for protein is 20^1+20^2+20^3=8420 for kmer of length (1,2,3), which limits the maximum length of each input sequence to 1188 (10,000,000 / 8420), a number large enough for most applications. This limit becomes much larger if users choose to use shorter kmers. Similarly, for DNA/RNA sequence analysis, the limit on input sequence length is 1832 for kmer length of (1,2,3,4,5,6) [=10,000,000/(4^1+4^2+4^3+4^4+4^5+4^6)] without considering degenerate letters. The limit will vary dramatically if users choose to use different degenerate letters. If the limit is reached, the webserver reports an error for too many tests, and users can go back and choose shorter kmers and / or omit degenerate letters.
Here are a few examples:
42,481 protein sequences, each 40 amino acids long: It took 8 seconds, 1.5 minutes, and 34 minutes to run with a kmer length up to 1, 2, and 3, respectively.
186 RNA sequences, each 50 nt long: It took 6 seconds to run with a kmer length up to 4, allowing gaps (i.e. the degenerate letter N), but 14 minutes if allowing all degenerate letters (ACGTRYMKWSBDHVN).
kpLogo has been wrapped as a Galaxy tool and is available in Galaxy Tool Shed (click here). To run kpLogo in a local instance of Galaxy, one still needs to compile and install kpLogo locally (see next section), and add the executable kpLogo to the path, such that one can directly call kpLogo by typing "kpLogo" in a terminal window.
kpLogo is an opensource project and users can download the source code, compile and install locally to enable all functionality.
std=c++11
.
src
,
then type make
: bin
.make clean
then make
.Click here to view the commandline options in plain text
An enhanced html version is coming soon.
Xuebing Wu and David Bartel (2017) kpLogo: positional kmer analysis reveals hidden specificity in biological sequences, Nucleic Acids Research, in press.
Web: no need to change default settings. If you want to renumber the coordinates such that the two nucleotides flanking the cleavage site are numbered as 1 and 1, respectively, set 'Position 1' to be 26 in the 'Output' option groups.
Full command line:
kpLogo premir50p3.fa gapped startPos 26 pc 0.01
kpLogo outputs four types of logos:
Web: select 'Weighted' or 'Ranked' in the 'Input' option groups. They give similar results. If you want to renumber the coordinates such that the first nucleotide of the guideRNA match is numbered as 1, set 'Position 1' to be 5 in the 'Output' option groups.
Full command line:
kpLogo gRNA.txt weighted gapped startPos 5 pc 0.01
kpLogo gRNA.txt ranked gapped startPos 5 pc 0.01
kpLogo outputs four types of logos:
The commandline version of kpLogo supports scoring of a list of new sequences using positional kmer motifs learned from a previous run. Briefly, for a given sequence to be scored, the score is the sum of log10(pvalue) of all positional kmers present in the sequence.
As a toy example, here we will score the guide RNA sequences using the positional motifs we learned from themselves:
Full command line:
# training
kpLogo gRNA.txt weighted o model_training
# prediction
kpLogo gRNA.txt predict model_training
# the output file model_training.score contains the score for each sequence
Copyright (c) 2017 Xuebing Wu.
kpLogo reuses the following library/code:
 Boost library (version
1.57.0)
 The ushuffle
code written by Minghui Jiang
kpLogo is developed and maintained by Xuebing Wu in the Bartel lab at Whitehead Institute.
I would like to thank Whitehead Institute IT system team (especially Paul McCabe, Andy NutterUpham, Craig Andrew, Alexan Mardigian) for help with the server setup.
Xuebing Wu and David Bartel (2017) kpLogo: positional kmer analysis reveals hidden specificity in biological sequences, Nucleic Acids Research, in press.