Searching for multiple strings in packet payloads

This section of our chapter excerpt on network algorithms explains how to use the Aho-Corasick algorithm for searching for multiple strings in one pass and the the classical approach of the Boyer-Moore algorithm.


Solution provider takeaway: Being able to detect suspicious strings in packet payloads is vital for proper intrusion detection in a client's network. This section of our chapter excerpt from the book "Network Security: Know It All" describes how the Aho-Corasick and Boyer-Moore algorithm search for multiple strings and the ways in which they differ.

Download the .pdf of the chapter here.

The first few sections tackle a problem of detecting an attack by searching for suspicious strings in payloads. A large number of attacks can be detected by their use of such strings. For example, packets that attempt to execute the Perl interpreter have perl.exe in their payload. For example, the arachNIDS database of vulnerabilities contains the following description.

An attempt was made to execute perl.exe. If the Perl interpreter is available to Web clients, it can be used to execute arbitrary commands on the Web server. This can be used to break into the server, obtain sensitive information, and potentially compromise the availability of the Web server and the machine it runs on. Many Web server administrators inadvertently place copies of the Perl interpreter into their Web server script directories. If perl is executable from the cgi directory, then an attacker can execute arbitrary commands on the Web server.

This observation has led to a commonly used technique to detect attacks in so-called signature-based intrusion detection systems such as Snort. The idea is that a router or monitor has a set of rules, much like classifiers. However, the Snort rules go beyond classifiers by allowing a 5-tuple rule specifying the type of packet (e.g., port number equal to Web traffic) plus an arbitrary string that can appear anywhere in the packet payload.

Thus the Snort rule for the attempt to execute perl.exe will specify the protocol (TCP) and destination port (80 for Web) as well as the string "perl.exe" occurring anywhere in the payload. If a packet matches this rule, an alert is generated. Snort has 300 such augmented rules, with 300 possible strings to search for.

Early versions of Snort do string search by matching each packet against each Snort rule in turn. For each rule that matches in the classifier part, Snort runs a Boyer-Moore search on the corresponding string, potentially doing several string searches per packet. Since each scan through a packet is expensive, a natural question is: Can one search for all possible strings in one pass through packet?

There are two algorithms that can be used for this purpose: the Aho-Corasick algorithm and a modified algorithm due to Commentz-Walter, which we describe next.

4.1.1 Integrated String Matching Using Aho-Corasick

A trie can be used to search for a string that starts at a known position in a packet. Thus Figure 4.2 contains a trie built on the set of two strings "babar" and "barney"; both are well-known characters in children's literature. The trie is built on characters and not on arbitrary groups of bits. The characters in the text to be searched are used to follow pointers through the trie until a leaf string is found or until failure occurs.

The hard part, however, is looking for strings that can start anywhere in a packet payload. The naivest approach would be to assume the string starts at byte 1 of the payload and then traverses the trie. Then if a failure occurs, one could start again at the top of the trie with the character that starts at byte 2.

However, if packet bytes form several "near misses" with target strings, then for each possible starting position, the search can traverse close to the height of the trie. Thus if the payload has L bytes and the trie has maximum height h, the algorithm can take Lh memory references.

For example, when searching for "babar" in the packet payload shown in Figure 4.2, the algorithm jogs merrily down the trie until it reaches the node corresponding to the second "a" in "babar." At that point the next packet byte is a "b" and not the "r" required to make progress in the trie. The naive approach would be to back up to the start of the trie and start the trie search again from the second byte "a" in the packet.

The Aho-Corasick algorithm builds an alphabetical trie on the set of strings to be searched for. A search for the string "barney" can be found by following the "b" pointer at the root, the "a" pointer at the next node, etc. More interestingly, the trie is augmented with failure pointers that prevent restarting at the top of the trie when failure occurs and a new attempt is made to match, shifted one position to the right.

However, it is not hard to see that backing up to the top is an obvious waste (P1) because the packet bytes examined so far in the search for "babab" have "bab" as a suffix, which is a prefix of "babar." Thus, rather than back up to the top, one can precompute (much as in a grid of tries) a failure pointer corresponding to the failing "b" that allows the search to go directly to the node corresponding to path "bab" in the trie, as shown by the leftmost dotted arc in Figure 4.2 .

Thus rather than have the fifth byte (a "b") lead to a null pointer, as it would in a normal trie, it contains a failure pointer that points back up the trie. Search now proceeds directly from this node using the sixth byte "a" (as opposed to the second byte) and leads after seven bytes to "babar."

Search is easy to do in hardware after the trie is precomputed. This is not hard to believe because the trie with failure pointers essentially forms a state machine. The Aho-Corasick algorithm has some complexity that ensues when one of the search strings, R, is a suffix of another search string, S. However, in the security context this can be avoided by relaxing the specification (P3). One can remove string S from the trie and later check whether the packet matched R or S.

Another concern is the potentially large number of pointers (256) in the Aho-Corasick trie. This can make it difficult to fit a trie for a large set of strings in cache (in software) or in SRAM (in hardware). One alternative is to use, say, Lulea-style encoding to compress the trie nodes.

Integrated Boyer-Moore by shifting a character.

4.1.2 Integrated String Matching Using Boyer-Moore

The famous Boyer-Moore algorithm for single-string matching can be derived by realizing that there is an interesting degree of freedom that can be exploited (P13) in string matching: One can equally well start comparing the text and the target string from the last character as from the first.

Thus in Figure 4.3 the search starts with the fifth character of the packet, a "b," and matches it to the fifth character of, say, "babar" (shown below the packet), an "r." When this fails, one of the heuristics in the Boyer-Moore algorithm is to shift the search template of "babar" two characters to the right to match the rightmost occurrence of "b" in the template.1 Boyer--Moore's claim to fame is that in practice it skips over a large number of characters, unlike, say, the Aho-Corasick algorithm.

To generalize Boyer-Moore to multiple strings, imagine that the algorithm concurrently compares the fifth character in the packet to the fifth character, "e," in the other string, "barney" (shown above the packet). If one were only doing Boyer-Moore with "barney," the "barney" search template would be shifted right by four characters to match the only "b" in barney.

When doing a search for both "barney" and "babar" concurrently, the obvious idea is to shift the search template by the smallest shift proposed by any string being compared for. Thus in this example, we shift the template by two characters and do a comparison next with the seventh character in the packet.

Doing a concurrent comparison with the last character in all the search strings may seem inefficient. This can be taken care of as follows. First, chop off all characters in all search strings beyond L, the shortest search string. Thus in Figure 4.3, L is 5 and "barney" is chopped down to "barne" to align in length with "babar."

There is a second heuristic in Boyer-Moore, but studies have shown that this simple Horspool variation works best in practice.

Having aligned all search string fragments to the same length, now build a trie starting backwards from the last character in the chopped strings. Thus, in the example of Figure 4.3 the root node of the trie would have an "e" pointer pointing toward "barne" and an "r" pointer pointing towards "babar." Thus comparing concurrently requires using only the current packet character to index into the trie node.

On success, the backwards trie keeps being traversed. On failure, the amount to be shifted is precomputed in the failure pointer. Finally, even if a backward search through the trie navigates successfully to a leaf, the fact that the ends may have been chopped off requires an epilogue, in terms of checking that the chopped-off characters also match. For reasonably small sets of strings, this method does better than Aho-Corasick.

The generalized Boyer--Moore was proposed by Commentz-Walter. The application to intrusion detection was proposed concurrently by Coit, Staniford, and McAlerney and Fisk and Varghese. The Fisk implementation has been ported to Snort.

Unfortunately, the performance improvement of using either Aho-Corasick or the integrated Boyer-Moore is minimal, because many real traces have only a few packets that match a large number of strings, enabling the naive method to do well. In fact, the new algorithms add somewhat more overhead due to slightly increased code complexity, which can exhibit cache effects.

While the code as it currently stands needs further improvement, it is clear that at least the Aho-Corasick version does produce a large improvement for worst-case traces, which may be crucial for a hardware implementation. The use of Aho-Corasick and integrated Boyer-Moore can be considered straightforward applications of efficient data structures (P15).

Network Security Algorithms
  4.1 Searching for multiple strings in packet payloads
  4.2 Approximate string matching
  4.3 IP traceback via probabilistic marking
  4.4 IP traceback via logging
  4.5 Detecting worms

About the book
The chapters contain compiled knowledge from recognized industry experts on topics such as the theory and practice of network security technology, analysis and problem-solving techniques and core security concepts.

Printed with permission from Morgan Kaufmann, a division of Elsevier. Copyright 2008. Network Security: Know It All by James Joshi. For more information about this title and other similar books, please visit

Dig Deeper on Managed network services technology