When it works properly, autocorrect and text-completion is one of the most useful communication tools that many of us use on a daily basis. The magic behind these tools are a tree-like structure that is able to find the next full word, or the closest word to the word that was being attempted without using an extravagant amounts of processing power or memory to store all the words in a dictionary. The structure behind all of this is called a Trie.
Assumptions
- Proficiency with <a href="https://www.devmaking.com/learn/data-structures/binary-search-tree/" target="_blank" style="color:inherit;">Binary Search Trees</a>
- Knowledge of <a href="https://www.devmaking.com/learn/algorithms/big-o-notation/" target="_blank" style="color:inherit;">Big-O Notation</a>
- Familiarity with <a href="https://www.devmaking.com/learn/data-structures/hashing/" target="_blank" style="color:inherit;">Hashing</a> is a plus!
What is a Trie Tree?
A trie is an ordered tree data structure, consisting of nodes that usually contain characters. A trie is a kind of key value structure, except instead of each node being a key, there is a chain of nodes starting from the root that make up the key, which is most commonly represented by a string. For instance, inserting the key "hello" with a value of 12 would be five nodes, one for each character, with the last node containing the value of that key.
<div style="width:100%; margin:auto;text-align:center;"> <img src="https://www.devmaking.com/img/topics/datastructures/Trie_01.png" alt="Trie header." style="width:500px;max-width:95%;"> </div>
> Tries are also commonly referred to as radix trees or prefix trees.
The Trie Node
Unlike a binary search tree node that contains references to its right and left sub-trees, a trie node can have n sub-trees, often referred to as children. Basic implementations of Trie node children can be made using arrays if the maximum amount of child nodes is already known, such as the letters in the alphabet. However, this can cause a large amount of unused memory, so representing the children by a linked list or a map can be advantageous.
A trie node also might contain a boolean "flag" such as isKey
, isLeaf
, or isWord
to signal that a certain node is the end of a key.
/* pseudo-code */
class TrieNode {
// The character key of the node (1).
char c;
// Here, we will be using a Map to contain the children.
Map<Character, TrieNode> children;
// A flag that represents if this is the end of a key.
boolean flag;
}
> (1) if we wanted to be space conservative, we could actually do without holding the character value in each trie node by only having it stored as a key in the parent node's map!
Conceptualizing Tries
Imagine a program that checks for misspelled words in a string of text. To figure out of a word is misspelled, you could check through all the words in a dictionary, and if the word doesn't exist, then it can be marked as being incorrect. In theory, this would work, however checking every word in the dictionary is a very impractical measure to take for each word in a sentence.
If you recall a nice property of balanced binary trees; every time you explore a subtree, you are effectively splitting the number of things that you need to look at in half from the last iteration.
Being that Tries are a tree-like structure, they can also experience some of that property as well. Let's say we implement the dictionary as a trie instead of a list, and the first version of our dictionary contains only 4 words: "Hello", "Friends", "Want", and "Fries".
Our dictionary now looks like this:
<div style="width:100%; margin:auto;text-align:center;"> <img src="https://www.devmaking.com/img/topics/datastructures/Trie_02.png" alt="trie dictionary example." style="width:500px;max-width:95%;"> </div>
To test out our program with the dictionary, we insert the sentence "Hello friends, want fres?" and our program processes each word. First, it checks that "hello" is in the dictionary; starting from the root node, it checks for "H" in the children nodes. From there it checks that nodes children for "E", then "L", and so on until it reaches the last character. Upon reaching this last letter, it checks to see if the node's variable isWord
is set to true. This checks out, so we can proceed to the next word.
The program continues checking the words until it reaches "fres". Looking at this, we can tell it's supposed to say "fries", however the program doesn't know that yet. So starting at the root node, the program traverses to the "F" node, then finds the "R" node. Now that it's here, it will look for the next character in that word, "E". The only problem is that "R" doesn't have a node "E" that comes after it! Knowing this, the program is able to deduce that the word must be misspelled, and lets the user know this is the case. With a message
"fres is not a word!"
.
As the dictionary fills up with more words, more sentences can be inserted and correctly evaluated for correct spelling. For some bonus points, you can try to think out how the program might then be able to suggest a correct word when it finds a misspelling.
> How much time with respect to the length of the sentence does the program take to check all the words?
Searching Tries
The dictionary example covered above is an excellent example of a real world use case of searching through a trie. To recap how this was done, we started at the root
trie node, then searched the node's children to find the first letter in the word. If it was found, then we move to that node, and search its children until the next letter in the word is found. This continues until the end of the word is reached, or until there is no node with the next character in the word.
To model this in pseudocode:
/* pseudo-code */
// The root of the Trie structure.
TrieNode root;
function contains(String word){
// Creating a pointer to traverse the tree.
TrieNode tmp = root;
// Traverse the length of the word.
for (character ch in word){
// For each letter, check that the next exists.
if( tmp.children contains ch ){
// if it does, go to the next node.
tmp = tmp.children.get(ch);
}
else {
// Otherwise we know it is not in the trie.
return false;
}
}
/* if we've made it here, we know the trie contains the letters
but we still need to see if that is actually a valid key.
checking the flag solves this. */
return tmp.isWord;
}
> Notice that we don't check the character of the root node? This is because the root node doesn't have a value by design; setting it to a character would mean all words would have to start with the same letter. Leaving it blank allows all the children to be the first letter of their respective words.
Trie Insertion
Inserting new keys into a trie follows the same pattern of logic as searching for a key in a trie. Starting from the root node, the tree is traversed until it reaches either a dead end or ends on an already existing node.
In the case that there is an already existing node as part of another key —if we had "hello" in the trie and wanted to put "he"— we would set the node's flag
to true
.
In the other case where we hit a dead end, we would continuously add new nodes and child nodes until the entire word has been entered into the trie.
/* pseudo-code */
TrieNode root;
function put(String word){
Node tmp = root;
// Iterate through the word.
for (character ch in word){
// If the children contain the next letter..
if( tmp.children contains ch ){
// .. Iterate to that child node.
tmp = tmp.children.get(ch);
}
// Otherwise ..
else {
// Create a new node..
TrieNode newNode = new TrieNode();
// Remember Map<Character, TrieNode>
tmp.children.put(ch, newNode);
// .. And iterate to that node.
tmp = tmp.children.get(ch);
}
}
// Now that we're at the last letter, set its flag to true.
tmp.isWord = true;
}
Trie Deletion
When it comes to trees, inserting and checking for elements are usually straightforward at the cost of removals being a bit tricky. When it comes to removing keys from a trie, there are a few scenarios that you might find:
- The key to be removed doesn't exist
- The key to be removed has children nodes
- The key exists and there are no children nodes from the key to be removed.
These three cases cover a majority of the complexity involved when removing from a trie. To tackle the first case is pretty obvious; if, while traversing to the end of the key, we find that it doesn't exist, we exit the function. In the next case, if the key has child nodes, then we mark isWord
to false, and exit the function. Lastly, if the key does not have any child nodes, we go back up the tree, removing nodes as we go until we reach a node that is flagged isWord
. Since there is no way to go to the parent node as is, we need to perform this recursively.
<div style="width:100%; margin:auto;text-align:center;"> <img src="https://www.devmaking.com/img/topics/datastructures/Trie_03.png" alt="removing help from a trie." style="width:500px;max-width:95%;"> </div>
/* pseudo-code */
TrieNode root;
function remove(String word){
remove_aux(root, word, 0);
}
function remove_aux(TrieNode current, String word, int depth){
// Base case check.
if( current is null ) return null;
// If we are at the word length.
if( depth equals word.length ){
current.isWord = false;
}
// (1)
else {
char ch = word.charAt(depth);
// (2)
TrieNode childNode = remove(current.children.get(ch), word, depth + 1);
// (3)
if( childNode is null ){
current.children.remove(ch);
}
}
// (4)
if( current.children is not empty ){
return current;
}
else return null;
}
This code might not be as intuitive to reason as with insert and search, however this code is able to handle all three cases gracefully. Analyzing the notes left in the code comments:
- In this else statement, it will always trigger if the key has not yet been reached.
- Making the recursive call here covers our first case; when the word does not exist. If the next character of the word does not exist, the call
current.children.get(ch)
will be null. WhenchildNode
is null, then the function will exit at the bottom if block. - This third step may seem confusing; if the child node is null, then why bother trying to remove something that isn't there? To find out why the code is this way, think about case 3, when the word length is reached, the code will jump down to the comment labeled
(4)
. Upon realizing that the list of child elements is empty, it will return null. Backing up to the previous recursive call, thechildNode
will subsequently be null as well. This means that the last letter will be removed from the trie. This can best be realized if you draw out the problem and follow the code! - As touched on, if the current node has children (after the recursive calls have been made), then we do not want to delete any of those elements as that would destroy parts of the tree that we want to remain in tact!
When to use a Trie
Tries are efficient data structures to use, especially when storing keys that are strings. They can also be used to store other data types such as floating point numbers. With a skilled hand, one could also figure out how to store other data types in a trie. Some of the practical use cases of tries are:
- Dictionaries: storing words as keys and the definitions of the words as the key's values.
- Misspellings: if a word is not within a trie that is serving as a dictionary, we can know very quickly if it is or is not a word.
- Autocorrect: if a word is not in the trie, you can back up from its most recent letter and recommend a word similar to what they were looking for.
- A phone book: with each phone number as a key, the value can be the holder's contact information.