lookihq.blogg.se - Cool otomata

COOL OTOMATA FULL

The sample input string used consisted of prefixes of the word 'abracadabra'. All the following rows, at least until 10 characters, required a similar number of probes as row 5.

COOL OTOMATA FULL

In this table, 'max strings' is the total number of strings within edit distance one of the input string, and the values for small, med, and full dict represent the number of probes required to search the three dictionaries (consisting of 1%, 10% and 100% of the web2 dictionary).

It may be instructive to see how the number of probes varies as a function of word length and dictionary size, by testing it with a variety of inputs: String length With larger alphabets, longer strings, or larger edit distances, this saving should be more pronounced. Note that if we assume an alphabet of 26 characters, there are 4+26*4+26*5=238 strings within levenshtein distance 1 of 'nice', so we've made a reasonable saving over exhaustively testing all of them. Working perfectly! Finding the 23 fuzzy matches for 'nice' in the dictionary of nearly 235,000 words required 142 probes. We could also use a couple of subsets for testing how things change with scale: > words10 => len(_)23> m.probes142 Here's the relevant function in Python: def levenshtein_automata(term, k): I won't include the source for the NFA class here, since it's fairly standard for gory details, see the Gist. Let's see how we can construct an NFA such as this given an input word and a maximum edit distance. Horizontal transitions represent unmodified characters, vertical transitions represent insertions, and the two types of diagonal transitions represent substitutions (the ones marked with a *) and deletions, respectively. States are named using a n e style notation, where n is the number of characters consumed so far, and e is the number of errors. As you can see, it's very regular, and the construction is fairly straightforward. The diagram on the right shows the NFA for a Levenshtein automaton for the word 'food', with maximum edit distance 2. There's much more to come, but first let's see what a Levenshtein automaton looks like, and how we can build one. Of course, if that were the only benefit of Levenshtein automata, this would be a short article. Compare this to the standard Dynamic Programming Levenshtein algorithm, which takes O(mn) time, where m and n are the lengths of the two input words! It's thus immediately apparrent that Levenshtein automaton provide, at a minimum, a faster way for us to check many words against a single target word and maximum distance - not a bad improvement to start with! Further, due to the nature of FSAs, it will do so in O(n) time with the length of the string being tested. We can then feed in any word, and the automaton will accept or reject it based on whether the Levenshtein distance to the target word is at most the distance specified when we constructed the automaton. The basic insight behind Levenshtein automata is that it's possible to construct a Finite state automaton that recognizes exactly the set of strings within a given Levenshtein distance of a target word. Today, I'm going to describe an alternative approach, which makes it possible to do fuzzy text search in a regular index: Levenshtein automata. In a previous Damn Cool Algorithms post, I talked about BK-trees, a clever indexing structure that makes it possible to search for fuzzy matches on a text string based on Levenshtein distance - or any other metric that obeys the triangle inequality.