Let’s code it: Sentence/word generations

Are you planning to create a name generator or a sentence generator? Generate strings using Markov chains could be the easy solution.

Let’s say that we have the following sentences:

  • I really like coding
  • I really love chatting

Below you can see the chain created splitting the sentence using 3 characters and adding weights for the number of repeated links (the underscore character is used instead of the space character just for clarity).

Markov chain

Note: the # character is used to mark the end of the string, this will be clear in the implementation.

Now with the chain created we can create new sentences:

  • I really like chatting
  • I really love coding

Actual implementation

The current implementation is slightly different from the diagram above. This is the chain generated from it:

Note: a list was used instead of the graph because it was confusing, the first elements in the list are the values extracted from the the sentences, the second ones are the links between the first and the next elements.

chain: {
    " ch": {
        "cha": 1.0,
    },
    " co": {
        "cod": 1.0,
    },
    " li": {
        "lik": 1.0,
    },
    " lo": {
        "lov": 1.0,
    },
    " re": {
        "rea": 2.0,
    },
    "I r": {
        " re": 2.0,
    },
    "all": {
        "lly": 2.0,
    },
    "att": {
        "tti": 1.0,
    },
    "cha": {
        "hat": 1.0,
    },
    "cod": {
        "odi": 1.0,
    },
    "din": {
        "ing": 1.0,
    },
    "e c": {
        " ch": 1.0,
        " co": 1.0,
    },
    "eal": {
        "all": 2.0,
    },
    "hat": {
        "att": 1.0,
    },
    "ike": {
        "ke ": 1.0,
    },
    "ing": {
        "ng#": 2.0,
    },
    "ke ": {
        "e c": 1.0,
    },
    "lik": {
        "ike": 1.0,
    },
    "lly": {
        "ly ": 2.0,
    },
    "lov": {
        "ove": 1.0,
    },
    "ly ": {
        "y l": 2.0,
    },
    "odi": {
        "din": 1.0,
    },
    "ove": {
        "ve ": 1.0,
    },
    "rea": {
        "eal": 2.0,
    },
    "tin": {
        "ing": 1.0,
    },
    "tti": {
        "tin": 1.0,
    },
    "ve ": {
        "e c": 1.0,
    },
    "y l": {
        " li": 1.0,
        " lo": 1.0,
    },
},

As you can see the substrings are generated chaning two of the previous characters and adding a new one instead of using the next n new characters, this will allow a much better end result.

The code

This is the base struct to collect the data for the Markov chain

pub struct TextGenerator {
    starts: BTreeMap<String, f32>,
    chain: BTreeMap<String, BTreeMap<String, f32>>,
}

It’s divided in two fields starts that contains all the values for the starting input strings and chain that contains all the chains. starts is useful so we can start from sensible rather than random values.

Next we need to fill the data structure using the following algorithm:

pub fn new(sentences: &[String], n: u32) -> Self {
    let mut starts = BTreeMap::new();
    let mut chain: BTreeMap<String, BTreeMap<String, f32>> = BTreeMap::new();
    let n = n as usize;
    for sentence in sentences {
        let mut word_terminated = sentence.to_owned();
        // Append the end mark character to the chain
        word_terminated.push('#');
        let chars = word_terminated.chars().collect::<Vec<_>>();
        // This is needed because we are using group of n characters
        for i in 0..chars.len() - n {
            // Extract the current group of n characters
            let tuple: String = chars[i..i + n].iter().collect();
            // If it's the first iteration, the start of the string
            if i == 0 {
                // Search inside the starts map if the value is already there
                // get it, otherwise add default (0.0) and add 1 anyway
                *starts.entry(tuple.clone()).or_default() += 1.0;
            }

            // Extract the next group of n characters
            let next: String = chars[i + 1..=i + n].iter().collect();
            // Search inside the chains map, if present get the tuple value from it,
            // otherwise add a new empty map, in both cases repeat the same process for
            // the starts this will link the tuple with the next value
            *chain.entry(tuple).or_default().entry(next).or_default() += 1.0;
        }
    }
    Self { starts, chain }
}

Note: the implementation uses BTreeMap because in my cases the creation of the chain happens at the start of the application. If you need dynamic chains a HashMap it’s probably better because of the sorting nature of the BTreeMap, you need to balance between creations and sentence generation.

After all the values are collected we can generate a value with the following method:

pub fn generate(&self, random: &mut StdRng) -> String {
    // Select a tuple from the starts to initiate the generation
    let mut tuple = TextGenerator::select_random_item(&self.starts, random);
    let mut result = String::with_capacity(20);
    // Add the tuple to the result
    result.push_str(&tuple);

    // While there is a valid tuple to extract let's continue
    while let Some(is) = self.chain.get(tuple) {
        /*
        Select a new tuple from the chain we selected.
        For example if we have started from the sentences "I li" and "I lo"
        the chain extracted will be like this
        "I l": {
            " li": 1.0,
            " lo": 1.0,
        },
        because the repetition (1.0) is the same we have 50% to select one of them
        (we use a random value to avoid this)
        */
        tuple = TextGenerator::select_random_item(is, random);
        // If the tuple ends with # means that we have reached the end of the generated
        // sentence and we can break
        if tuple.ends_with('#') {
            break;
        } else {
            // Otherwise get the last character and append that into the result.
            // Other than the start we are building the generated sentence one character at a time
            let last_character = tuple.chars().next_back().unwrap();
            result.push(last_character);
        }
    }
    result
}

fn select_random_item<'a>(items: &'a BTreeMap<String, f32>, random: &mut StdRng) -> &'a String {
    // This will generate a random number between 0.0 and 1.0 and multiply that with the sum of all the
    // sub-chain values, allowing us to vary the possibility of getting only the highest ones
    let mut random_value = random.gen::<f32>() * items.iter().map(|(_, v)| v).count() as f32;
    for (k, v) in items {
        // Decrease the current item from the random value
        random_value -= *v as f32;
        // To see if the value is the correct one
        if random_value < 0.0 {
            return k;
        }
    }
    panic!("No item found");
}

Note: The values generated are not perfect but tweaking the input dataset and checking the results could make it a lot better than the plain version.

See you next time!

Updated: