This post is part of the “Algos in Plain English” post series.
The Knuth-Morris-Pratt (KMP) string search algorithm aims to make searching for a word w
in a body of text s
efficient. This algorithm is expected to run in O(n)
, where n
is the size of s
.
The problem statement that KMP aims to solve is:
Given a word
w
and a body of texts
, find the first index ins
wherew
is found. Ifw
is not found, return-1
.
The naive approach attempts to repeatedly find w
in the body of text s
and has a polynomial runtime. The naive approach can be written as follows:
def search(w: str, s: str) -> int:
m, n = len(w), len(s)
# Go through the indices of s to start checking
# for w at each start index
for i in range(n - m + 1):
found = True
# Start going through w starting at index i
for j in range(m):
# If there's a mismatch, break out early, as
# w was definitely not found at index i
if w[j] != s[i + j]:
found = False
break
if found:
return i
return -1
In the worst-case, the above runs in O(m*(n-m+1))
. A breakdown of this is:
O(n-m+1)
comes from the outer for
-loopO(m)
comes from the inner for
-loop where w
is attempted to be located as a substring in s
In the worst case, the algorithm fully evaluates the inner loop over and over again to completion.
Examples of worst-case inputs are as follows:
# Example 1: w is not found til the very end of s
w = "XXXY"
s = "XXXXXXXXXXXXXXXXXY"
# Example 2: w is repeatedly almost found, but not found in s
w = "XXXY"
s = "XXXXXXXXXXXXXXXXXX"
The naive approach evaluated above repeats work by performing a complete reset when attempting to find the word w
in s
. This re-examination makes the naive approach inefficient, as it forces the addition of a factor of m
to the time complexity.
KMP’s algorithm design is contingent on being able to decide the following when traversing the body of text s
, when w
is deemed to be not found:
w
is not found ins
based on finding a mis-matching character. How much ofw
have we found based on where we found the mis-matching character inw
ands
?
To make this more clear, let’s take the following example inputs:
w = "YYYZ"
s = "YYYYZ"
If the naive algorithm was used, "YYYY"
would be evaluated in s
to not match w
. To find w
in s
, the naive algorithm would inevitably re-examine the suffix "YYY"
of "YYYY"
. However, a smarter algorithm should be able to recognize that "YYYY"
’s suffix, "YYY"
is the prefix of w
. This would save the algorithm the time-expenditure of re-examining characters in s
.
KMP aims to enable this time-savings.
KMP enables string search by performing 2 steps.
T
The goal of this step is the following:
Build a data-structure
T
where indexi
describes the length of the longest proper prefix ofw[0..i]
that’s also a suffix.
Of note: a “proper” prefix is just a prefix that excludes the whole word (e.g. proper prefix of "ABC"
excludes "ABC"
itself as a prefix).
T
would look lke the following for the string "YYYY"
:
T = [0, 1, 2, 3]
T
would look like the following for the string "ZZYZZXZZYZZ"
:
T = [0, 1, 0, 1, 2, 0, 1, 2, 3, 4, 5]
w
in s
The goal of this step is the following:
Find
w
ins
, usingT
as a way to prevent re-examination of characters inw
ands
.
The way this is accomplished is by matching characters in w
and s
in the same way this is performed in the naive algorithm. When a mismatch is found, T
is used to determine how much re-examination can be “skipped”. This means we know exactly how much of the beginning of w
we’ve found and we can continue scanning s
from where we left off.
Below is sample code for KMP. This is heavily-commented and tweaked version of some of the code found on the detailed GeeksforGeeks article on KMP.
from typing import List
def compute_T(w: str) -> List[int]:
# Handle edge-case where w is empty
if not w:
return []
T = [0] * len(w)
prefix_len = 0
# Starts at 1 b/c T[0] = 0 always; it's impossible for a
# proper prefix to be length > 0 if the string is only one
# character
i = 1
while i < len(w):
# If the current character in the prefix matches the
# character at the end of the w[0..i] window, extend
# the candidate prefix length and set T[i] = prefix_len
if w[prefix_len] == w[i]:
prefix_len += 1
T[i] = prefix_len
i += 1
else:
# "Back-up" in T to attempt to find a match and continue
# advancing
if prefix_len > 0:
prefix_len = T[prefix_len - 1]
# If prefix_len is 0, just set T[i] = 0 and advance i
else:
T[i] = 0
i += 1
return T
def kmp(w: str, s: str) -> int:
T = compute_T(w)
# i = position in s being inspected
# j = position in w being inspected
i, j = 0, 0
while i < len(s):
# Advance the positions in w and s; same as in the
# naive algorithm
if w[j] == s[i]:
i += 1
j += 1
# The characters in w and s do not match and some of
# w has been found
elif j > 0:
# "Back-up" in T to grab the length of w that WAS matched
# so far
j = T[j - 1]
else:
# None of w has been found, so just advance where in s
# is being considered for character matching
i += 1
# Do this last so that we can break out if the string was found;
# this avoids having to an extra check after the while-loop
if j == len(w):
# No +1 because i will be one ahead of where
# the substring was found
return i - j
# No match was found in the while-loop, so return -1 as a
# failure indicator
return -1
T
being used?The most nuanced part of both the pre-processing algorithm and KMP itself is the part where the algorithm “backs-up” and grabs a value from T
to attempt future matching. This is the main optimization KMP has over the naive approach, and it’s worth understanding.
Let’s take an example:
s = "ABCDABYABCDABD"
w = "ABCDABD"
s
will match w
until the "Y" != "D"
mismatch. What KMP will then do is “back-up” and set j = T[j-1]
. Why? Because w[0..j-1]
represents "ABCDAB"
at this point. KMP is grabbing the length value from T
that represents what we HAVE matched of w
. The length will be 2
at this point, representing the proper prefix and suffix "AB"
.
KMP will then try to match "Y"
to "C"
, resulting in a failed match:
"ABCDABYABCDABD"
"ABCDABD"
^
This failure causes KMP to set the value of j
to T[j-1]
, which is going to be 0
. This leads to one more comparison, "Y"
to "A"
, resulting in a failed match:
"ABCDABYABCDABD"
"ABCDABD"
^
KMP then advances the window being considered, since "Y"
isn’t going to match with any characters in w
. Comparison resumes in this way:
"ABCDABYABCDABD"
"ABCDABD"
^
The above logic happens the same way for building T
itself. It uses prior knowledge of T
to build future entries in T
. If we consider the current, working prefix as an expanding string that is attempted to be found in w
, the above reasoning can be applied in the same way to the T
-building algorithm. Thinking through this nuance is a good intellectual exercise, and I encourage you to draw up some examples!
The overall KMP algorithm has a time-complexity of O(n)
, where n = len(s)
. This assumes s
is larger than w
. The biggest thing to understand for this is: how can the algorithm have this linear time-complexity if we’re “backing-up” through previous indices sometimes (e.g. the j = T[j - 1]
step, when i
is not advanced)?
The answer: in order to “expand” j
to be a size, we need to make progress via matching (a.k.a. working through s
). This means we can only “contract” j
at most O(n)
times across the entire algorithm. This leaves us with an overall runtime of O(2n)
, which is just O(n)
. The same logic can be used to understand why the building of T
also has linear time complexity.
Through a clever pre-computed data-structure T
that stores the length of the shared proper prefix and suffix of w[0..i]
at index i
, KMP enables linear string search time-complexity. Understanding this algorithm is powerful because it builds up re-usable concepts involving prefixes and suffixes of substrings.
strStr()
” question.