Community for developers to learn, share their programming knowledge. Register!
String Matching Algorithms

Knuth-Morris-Pratt (KMP) Algorithm


You can get training on our article to understand one of the fundamental algorithms in string matching: the Knuth-Morris-Pratt (KMP) algorithm. The KMP algorithm is an efficient solution to the problem of finding all occurrences of a pattern within a text. This algorithm was developed by computer scientists Donald Knuth, Vaughan Pratt, and James H. Morris in 1977. It is considered groundbreaking because it introduced a systematic way to avoid unnecessary comparisons in string matching, making it significantly faster than naive approaches.

At its core, the KMP algorithm preprocesses the pattern to create a data structure called the prefix function (or failure function). This preprocessing allows the algorithm to skip rechecking characters that are already known to match. As a result, KMP is particularly useful in scenarios where you need to search for a pattern multiple times in a large text, such as in text editors, search engines, or bioinformatics applications.

How KMP Algorithm Works

To understand how the KMP algorithm operates, let’s break it down into two key phases:

  • Preprocessing Phase: The algorithm first processes the input pattern to construct the prefix function. This function records the length of the longest proper prefix of the pattern that is also a suffix for every position in the pattern. This preprocessing step enables the algorithm to efficiently skip redundant comparisons later in the matching phase.
  • Search Phase: Once the prefix function is built, the actual search begins. The algorithm compares the pattern with the text from left to right. If there is a mismatch, instead of restarting the comparison from the beginning of the pattern, KMP uses the information stored in the prefix function to determine how far to shift the pattern.

Let’s consider a quick example. Suppose we want to find the pattern "ABABC" in the text "ABABDABABC". The naive approach would involve re-evaluating already checked characters, but KMP skips unnecessary comparisons using the prefix function.

Prefix Function in KMP Algorithm

The prefix function is central to the efficiency of the KMP algorithm. It is represented as an array, where each entry at index i stores the length of the longest proper prefix of the pattern that matches a suffix ending at position i. A proper prefix is defined as a prefix that does not include the entire string.

For example, consider the pattern "ABABC". The prefix function for this pattern is computed as follows:

  • At index 0 (A): There is no proper prefix, so the value is 0.
  • At index 1 (AB): No proper prefix matches the suffix, so the value is 0.
  • At index 2 (ABA): The prefix "A" matches the suffix "A", so the value is 1.
  • At index 3 (ABAB): The prefix "AB" matches the suffix "AB", so the value is 2.
  • At index 4 (ABABC): No proper prefix matches the suffix, so the value is 0.

Thus, the prefix function array for "ABABC" is [0, 0, 1, 2, 0].

Here’s Python code to compute the prefix function:

def compute_prefix_function(pattern):
    m = len(pattern)
    prefix = [0] * m
    j = 0  # Length of the previous longest prefix suffix

    for i in range(1, m):
        while j > 0 and pattern[i] != pattern[j]:
            j = prefix[j - 1]

        if pattern[i] == pattern[j]:
            j += 1
        prefix[i] = j

    return prefix

pattern = "ABABC"
print(compute_prefix_function(pattern))  # Output: [0, 0, 1, 2, 0]

Advantages of KMP Algorithm

The KMP algorithm offers several advantages over other string matching techniques:

  • Efficiency: The worst-case time complexity for both preprocessing and search is linear, making it highly efficient for large-scale text processing.
  • No Backtracking: Unlike naive string matching, KMP avoids unnecessary re-comparisons by leveraging the prefix function.
  • Deterministic Behavior: KMP has a predictable performance regardless of the input text or pattern, unlike algorithms like Boyer-Moore, which depend on the structure of the pattern.

Time Complexity of KMP Algorithm

The KMP algorithm is celebrated for its linear time complexity, which can be broken down as follows:

  • Preprocessing Phase: Computing the prefix function takes O(m) time, where m is the length of the pattern.
  • Search Phase: The search phase also runs in O(n) time, where n is the length of the text.

Thus, the overall time complexity of the KMP algorithm is O(n + m). This efficiency makes it particularly suitable for searching patterns in massive texts or repeated searches with the same pattern.

Space Complexity of KMP Algorithm

The space complexity of the KMP algorithm is determined by the storage required for the prefix function array. As this array has a size equal to the length of the pattern, the space complexity is O(m). In addition, the algorithm uses a constant amount of extra space for variables during execution, making it efficient in terms of memory usage.

Summary

The Knuth-Morris-Pratt (KMP) algorithm is a cornerstone in the field of string matching algorithms, offering a highly efficient and deterministic approach to pattern searching. By leveraging the prefix function to eliminate redundant comparisons, KMP achieves a linear time complexity of O(n + m) and a space complexity of O(m). This makes it an excellent choice for real-world applications in text processing, data analytics, and computational biology.

In this article, we explored the workings of the KMP algorithm in detail, covering its preprocessing and search phases, the role of the prefix function, and its computational efficiency. Whether you’re a developer working on text-based software or an enthusiast delving into algorithms, mastering KMP will enhance your problem-solving toolkit for efficient string matching. For further learning, consider consulting resources like "Introduction to Algorithms" by Cormen et al., which provides additional insights into this powerful algorithm.

Last Update: 25 Jan, 2025

Topics:
Algorithms