Sometimes we are in the position in which we want to search quickly if a word is inside a set, or in this case a dictionary, of defined words. Tries could be the easy and fast way to achieve this.
What is a trie? A trie is a efficient data structure, basically a tree, where every key is split into characters and placed inside the tree in a specific way. Here an example, let say that our dictionary is composed by these words:
at
test
the
The trie constructed with them is:
So for each character we create a node and connect that with the next character, and at the end place a “ending node”. If the character is already present we create just a link with the next character or just descend the next node if already present. The End node is needed to understand if a word is present or not; without we could say that the word test is inside the trie when in reality we only put the word tests.
Building blocks
The struct
The definition of the trie is the following:
A Node is just an enum that represent a node inside the trie. We have only two values:
Value that contains the single byte;
End that represent the end of a sequence of bytes, this is to mark the end of a word inside the trie.
The trie struct is using an ArenaTree to make it as compact as possible in memory. The other two values (min and max) are used to optimise the search process, checking if the input value could be inside the trie, a smaller or bigger key, lengthwise, it’s not possible to be inside the trie.
Note in our case we use bytes to handle utf-8 characters in an easier way
Add a key
Adding each byte of the key is easy:
Navigate the tree;
If the Value is not present create it otherwise move to the next;
After all the bytes are added to the trie add the End node as a child of the last byte.
Search for a key
Searching for a key is even easier than the previous function:
Check if the key is inside the size of the values added in the trie, if not it means that the key is not part of the dictionary;
If it is, navigate the tree bytewise;
If all the bytes are present as well as the End node the key has been found.
Note the code duplication is left in place on purpose because makes the code faster and is not so much that could affect development time in the future