Let’s code it: Sentence/word generations
Are you planning to create a name generator or a sentence generator? Generate strings using Markov chains could be the easy solution.
Let’s say that we have the following sentences:
- I really like coding
- I really love chatting
Below you can see the chain created splitting the sentence using 3 characters and adding weights for the number of repeated links (the underscore character is used instead of the space character just for clarity).
Note: the # character is used to mark the end of the string, this will be clear in the implementation.
Now with the chain created we can create new sentences:
- I really like chatting
- I really love coding
Actual implementation
The current implementation is slightly different from the diagram above. This is the chain generated from it:
Note: a list was used instead of the graph because it was confusing, the first elements in the list are the values extracted from the the sentences, the second ones are the links between the first and the next elements.
chain: {
" ch": {
"cha": 1.0,
},
" co": {
"cod": 1.0,
},
" li": {
"lik": 1.0,
},
" lo": {
"lov": 1.0,
},
" re": {
"rea": 2.0,
},
"I r": {
" re": 2.0,
},
"all": {
"lly": 2.0,
},
"att": {
"tti": 1.0,
},
"cha": {
"hat": 1.0,
},
"cod": {
"odi": 1.0,
},
"din": {
"ing": 1.0,
},
"e c": {
" ch": 1.0,
" co": 1.0,
},
"eal": {
"all": 2.0,
},
"hat": {
"att": 1.0,
},
"ike": {
"ke ": 1.0,
},
"ing": {
"ng#": 2.0,
},
"ke ": {
"e c": 1.0,
},
"lik": {
"ike": 1.0,
},
"lly": {
"ly ": 2.0,
},
"lov": {
"ove": 1.0,
},
"ly ": {
"y l": 2.0,
},
"odi": {
"din": 1.0,
},
"ove": {
"ve ": 1.0,
},
"rea": {
"eal": 2.0,
},
"tin": {
"ing": 1.0,
},
"tti": {
"tin": 1.0,
},
"ve ": {
"e c": 1.0,
},
"y l": {
" li": 1.0,
" lo": 1.0,
},
},
As you can see the substrings are generated chaning two of the previous characters and adding a new one instead of using the next n new characters, this will allow a much better end result.
The code
This is the base struct to collect the data for the Markov chain
pub struct TextGenerator {
starts: BTreeMap<String, f32>,
chain: BTreeMap<String, BTreeMap<String, f32>>,
}
It’s divided in two fields starts
that contains all the values for the starting input strings and chain
that contains all the chains. starts
is useful so we can start from sensible rather than random values.
Next we need to fill the data structure using the following algorithm:
pub fn new(sentences: &[String], n: u32) -> Self {
let mut starts = BTreeMap::new();
let mut chain: BTreeMap<String, BTreeMap<String, f32>> = BTreeMap::new();
let n = n as usize;
for sentence in sentences {
let mut word_terminated = sentence.to_owned();
// Append the end mark character to the chain
word_terminated.push('#');
let chars = word_terminated.chars().collect::<Vec<_>>();
// This is needed because we are using group of n characters
for i in 0..chars.len() - n {
// Extract the current group of n characters
let tuple: String = chars[i..i + n].iter().collect();
// If it's the first iteration, the start of the string
if i == 0 {
// Search inside the starts map if the value is already there
// get it, otherwise add default (0.0) and add 1 anyway
*starts.entry(tuple.clone()).or_default() += 1.0;
}
// Extract the next group of n characters
let next: String = chars[i + 1..=i + n].iter().collect();
// Search inside the chains map, if present get the tuple value from it,
// otherwise add a new empty map, in both cases repeat the same process for
// the starts this will link the tuple with the next value
*chain.entry(tuple).or_default().entry(next).or_default() += 1.0;
}
}
Self { starts, chain }
}
Note: the implementation uses BTreeMap because in my cases the creation of the chain happens at the start of the application. If you need dynamic chains a HashMap it’s probably better because of the sorting nature of the BTreeMap, you need to balance between creations and sentence generation.
After all the values are collected we can generate a value with the following method:
pub fn generate(&self, random: &mut StdRng) -> String {
// Select a tuple from the starts to initiate the generation
let mut tuple = TextGenerator::select_random_item(&self.starts, random);
let mut result = String::with_capacity(20);
// Add the tuple to the result
result.push_str(&tuple);
// While there is a valid tuple to extract let's continue
while let Some(is) = self.chain.get(tuple) {
/*
Select a new tuple from the chain we selected.
For example if we have started from the sentences "I li" and "I lo"
the chain extracted will be like this
"I l": {
" li": 1.0,
" lo": 1.0,
},
because the repetition (1.0) is the same we have 50% to select one of them
(we use a random value to avoid this)
*/
tuple = TextGenerator::select_random_item(is, random);
// If the tuple ends with # means that we have reached the end of the generated
// sentence and we can break
if tuple.ends_with('#') {
break;
} else {
// Otherwise get the last character and append that into the result.
// Other than the start we are building the generated sentence one character at a time
let last_character = tuple.chars().next_back().unwrap();
result.push(last_character);
}
}
result
}
fn select_random_item<'a>(items: &'a BTreeMap<String, f32>, random: &mut StdRng) -> &'a String {
// This will generate a random number between 0.0 and 1.0 and multiply that with the sum of all the
// sub-chain values, allowing us to vary the possibility of getting only the highest ones
let mut random_value = random.gen::<f32>() * items.iter().map(|(_, v)| v).count() as f32;
for (k, v) in items {
// Decrease the current item from the random value
random_value -= *v as f32;
// To see if the value is the correct one
if random_value < 0.0 {
return k;
}
}
panic!("No item found");
}
Note: The values generated are not perfect but tweaking the input dataset and checking the results could make it a lot better than the plain version.
See you next time!