Let’s code it: Fast dictionary search

Sometimes we are in the position in which we want to search quickly if a word is inside a set, or in this case a dictionary, of defined words. Tries could be the easy and fast way to achieve this.

What is a trie? A trie is a efficient data structure, basically a tree, where every key is split into characters and placed inside the tree in a specific way. Here an example, let say that our dictionary is composed by these words:

  • at
  • test
  • the

The trie constructed with them is:

Trie

So for each character we create a node and connect that with the next character, and at the end place a “ending node”. If the character is already present we create just a link with the next character or just descend the next node if already present. The End node is needed to understand if a word is present or not; without we could say that the word test is inside the trie when in reality we only put the word tests.

Building blocks

The struct

The definition of the trie is the following:

pub enum Node<T>
where
    T: Display,
{
    Value(T),
    End,
}

pub struct Trie {
    root: ArenaTree<Node<u8>>,
    min: usize,
    max: usize,
}

impl Trie {
    pub fn new() -> Self {
        let mut root = ArenaTree::new();
        // Adds the first node for the root, this will not be considered in the search
        root.add_child(None, Node::Value(0)).unwrap();
        Self {
            root,
            // Min needs to be the maximum value we can have
            min: usize::MAX,
            // Max needs to be the minimum value we can have
            max: 0,
        }
    }
    }
}

A Node is just an enum that represent a node inside the trie. We have only two values:

  • Value that contains the single byte;
  • End that represent the end of a sequence of bytes, this is to mark the end of a word inside the trie.

The trie struct is using an ArenaTree to make it as compact as possible in memory. The other two values (min and max) are used to optimise the search process, checking if the input value could be inside the trie, a smaller or bigger key, lengthwise, it’s not possible to be inside the trie.

Note in our case we use bytes to handle utf-8 characters in an easier way

Add a key

Adding each byte of the key is easy:

  1. Navigate the tree;
  2. If the Value is not present create it otherwise move to the next;
  3. After all the bytes are added to the trie add the End node as a child of the last byte.
pub fn add_key(&mut self, key: &[u8]) {
    // Checks if the length of the new key is smaller then the one already saved
    if self.min > key.len() {
        self.min = key.len();
    }
    // Checks if the length of the new key is bigger then the one already saved
    if self.max < key.len() {
        self.max = key.len();
    }
    // Starts with the root node, this has been forced in the construction process
    let mut node = 0;
    // For each byte inside the key
    for b in key {
        let mut found = false;
        // Starts with the first child in the children list
        let mut child = self.root[node].first_child;
        // Until there is a child
        while let Some(c) = child {
            // Checks if the node is a Value node
            match self.root[c].data {
                Node::Value(v) => {
                    // If the byte is the same that the one inside the Value node
                    if v == *b {
                        // Move the node to the found child
                        node = c;
                        found = true;
                        // Stop the loop a node for the current byte has been found
                        break;
                    }
                }
                _ => {}
            }
            // Moves to the next child
            child = self.root[c].next_node;
        }

        // If the node has not been found a new node needs to be created as a child of the previous found node
        if !found {
            node = self.root.add_child(Some(node), Node::Value(*b)).unwrap();
        }
    }

    // Marks the sequence with an End node
    let mut found = false;
    // Starts with the first child in the last node of the sequence
    let mut child = self.root[node].first_child;
    // Until there is a child
    while let Some(c) = child {
        // Checks if there is an ending node
        match self.root[c].data {
            Node::End => {
                found = true;
                // Stops the loop for the current child, a End node has been found
                break;
            }
            _ => {}
        }
        // Moves to the next child
        child = self.root[c].next_node;
    }

    // If the End node has not been found a new one needs be created as a child of the previous found node
    if !found {
        self.root.add_child(Some(node), Node::End).unwrap();
    }
}

Search for a key

Searching for a key is even easier than the previous function:

  1. Check if the key is inside the size of the values added in the trie, if not it means that the key is not part of the dictionary;
  2. If it is, navigate the tree bytewise;
  3. If all the bytes are present as well as the End node the key has been found.
pub fn is_key_present(&self, key: &[u8]) -> bool {
    // If the key is outside of the minimum and maximum values contained in the data structure
    if key.len() < self.min || key.len() > self.max {
        // Returns that is not present
        return false;
    }
    // Starts with the root node, this has been forced in the construction process
    let mut node = 0;
    // For each byte inside the key
    for b in key {
        let mut found = false;
        // Starts with the first child in the children list
        let mut child = self.root[node].first_child;
        // Checks if the node is a Value node
        while let Some(c) = child {
            match self.root[c].data {
                Node::Value(v) => {
                    // If the byte is the same that the one inside the Value node
                    if v == *b {
                        // Move the node to the found child
                        node = c;
                        found = true;
                        // Stop the loop a node for the current byte has been found
                        break;
                    }
                }
                _ => {}
            }
            // Moves to the next child
            child = self.root[c].next_node;
        }

        // If not found it means that the byte is not present inside the data structure
        if !found {
            return false;
        }
    }

    
    // Searches for the End node in the sequence
    let mut found = false;
    // Starts with the first child in the last node of the sequence
    let mut child = self.root[node].first_child;
    // Until there is a child
    while let Some(c) = child {
        // Checks if there is an ending node
        match self.root[c].data {
            Node::End => {
                found = true;
                // Stops the loop for the current child, a End node has been found
                break;
            }
            _ => {}
        }
        // Moves to the next child
        child = self.root[c].next_node;
    }

    // Returns the status of the flag, this should be true only if all bytes and an End node have been found
    found
}

Note the code duplication is left in place on purpose because makes the code faster and is not so much that could affect development time in the future

See you next time!

Updated: