Let’s code it: Sentence/word generations
Are you planning to create a name generator or a sentence generator? Generate strings using Markov chains could be the easy solution.
Let’s say that we have the following sentences:
- I really like coding
- I really love chatting
Below you can see the chain created splitting the sentence using 3 characters and adding weights for the number of repeated links (the underscore character is used instead of the space character just for clarity).
Note: the # character is used to mark the end of the string, this will be clear in the implementation.
Now with the chain created we can create new sentences:
- I really like chatting
- I really love coding
Actual implementation
The current implementation is slightly different from the diagram above. This is the chain generated from it:
Note: a list was used instead of the graph because it was confusing, the first elements in the list are the values extracted from the the sentences, the second ones are the links between the first and the next elements.
chain: {
" ch": {
"cha": 1.0,
},
" co": {
"cod": 1.0,
},
" li": {
"lik": 1.0,
},
" lo": {
"lov": 1.0,
},
" re": {
"rea": 2.0,
},
"I r": {
" re": 2.0,
},
"all": {
"lly": 2.0,
},
"att": {
"tti": 1.0,
},
"cha": {
"hat": 1.0,
},
"cod": {
"odi": 1.0,
},
"din": {
"ing": 1.0,
},
"e c": {
" ch": 1.0,
" co": 1.0,
},
"eal": {
"all": 2.0,
},
"hat": {
"att": 1.0,
},
"ike": {
"ke ": 1.0,
},
"ing": {
"ng#": 2.0,
},
"ke ": {
"e c": 1.0,
},
"lik": {
"ike": 1.0,
},
"lly": {
"ly ": 2.0,
},
"lov": {
"ove": 1.0,
},
"ly ": {
"y l": 2.0,
},
"odi": {
"din": 1.0,
},
"ove": {
"ve ": 1.0,
},
"rea": {
"eal": 2.0,
},
"tin": {
"ing": 1.0,
},
"tti": {
"tin": 1.0,
},
"ve ": {
"e c": 1.0,
},
"y l": {
" li": 1.0,
" lo": 1.0,
},
},
As you can see the substrings are generated chaning two of the previous characters and adding a new one instead of using the next n new characters, this will allow a much better end result.
The code
This is the base struct to collect the data for the Markov chain
It’s divided in two fields starts
that contains all the values for the starting input strings and chain
that contains all the chains. starts
is useful so we can start from sensible rather than random values.
Next we need to fill the data structure using the following algorithm:
Note: the implementation uses BTreeMap because in my cases the creation of the chain happens at the start of the application. If you need dynamic chains a HashMap it’s probably better because of the sorting nature of the BTreeMap, you need to balance between creations and sentence generation.
After all the values are collected we can generate a value with the following method:
Note: The values generated are not perfect but tweaking the input dataset and checking the results could make it a lot better than the plain version.
See you next time!