Russian version is here.
Let’s say we had an idea to create a JS script that would generate random words (nicknames).
Let’s start with the simplest approach first. If we just generate random letters and make words using these letters, they will look unnatural and unsightly. Here are some examples:
- srjxdq
- moyssj
- ywtckmw
- wjvzw
- xtwey
etc.
As you can see, this approach does not allow us to generate words that even remotely resemble natural ones – at the output, we simply generate a set of meaningless letters that looks more like passwords than words. Here are two points how to make the generated random words look more natural (in my opinion):
- Avoid occurrences of more than two vowels/consonants when generating a word. This problem is trivial and I will not consider it.
- Pick random letters for a word based on their weight. Weights in this case will be the frequency of letters in English. This way we have to reduce/increase the chance that a certain letter will end up in our generated word, and rarely used letters such as Q, Z and X will appear in our words much less often than E, T, A, O , I, which are statistically the most frequent in English words.
Using just these two approaches, we generate much more “natural” words. Here are some examples:
Now it’s much better. Let’s analyze the 2nd point in more detail.
Algorithm for selecting random array elements based on weights in JS
A relatively simple implementation of such an algorithm is the transformation of a series of rational numbers s1 (array), which are weights for elements, into a series of numbers s2, which is obtained by cumulative addition of numbers:
const items = [ 'a', 'b', 'c' ];
const weights = [ 3, 7, 1 ];
Prepare an array of weights via cumulative addition (i.e. a cumulativeWeights list that will have the same number of elements as the original weights list of weights). In our case, such an array will look like this:
cumulativeWeights = [3, 3 + 7, 3 + 7 + 1] = [3, 10, 11]
We now generate a randomNumber from 0 to the highest cumulative weight value. In our case, the random number will be in the range [0..11]. Let’s say randomNumber = 8.
Loop through the cumulativeWeights array from left to right and select the first element that is greater than or equal to randomNumber. We will use the index of such an element to select an element from an array of elements.
The idea behind this approach is that higher weights will “occupy” more numerical space. Therefore, there is a higher probability that the random number will end up in the “number bucket” with a higher weight.
I’ll try to demonstrate this using the example of my script:
const weights = [3, 7, 1 ];
const cumulativeWeights = [3, 10, 11];
// In a pseudo-view, we can represent cumulativeWeights like this:
const pseudoCumulativeWeights = [ 1, 2, 3, // <-- [3] numbers
4, 5, 6, 7, 8, 9, 10, // <-- [7] numbers
11, // <-- [1] number
];
As you can see, heavier weights occupy a higher numerical space and therefore have a higher chance of being randomly selected. The percentage of selection chance for the weights elements will be as follows:
Element 3: ≈ 27%,
Element 7: ≈ 64%,
Element 1: ≈ 9%
In general, the function looks something like this:
function weightedRandom(items, weights) {
if (items.length !== weights.length) {
throw new Error('Arrays of elements and weights must be the same size');
}
if (!items.length) {
throw new Error('Array elements must not be empty');
}
const cumulativeWeights = [];
for (let i = 0; i < weights.length; i += 1) {
cumulativeWeights[i] = weights[i] + (cumulativeWeights[i - 1] || 0);
}
const maxCumulativeWeight = cumulativeWeights[cumulativeWeights.length - 1];
const randomNumber = maxCumulativeWeight * Math.random();
for (let itemIndex = 0; itemIndex < items.length; itemIndex += 1) {
if (cumulativeWeights[itemIndex] >= randomNumber) {
return items[itemIndex];
}
}
}
How can the word generation algorithm be even better?
This script is more of an example of using an algorithm for selecting a random element of an array based on their weight, so I did not go deep into linguistics and artificial intelligence algorithms. But offhand, unsightly combinations of some vowel and consonant pairs that do not occur in real words and looks unnatural. Examples:
- satlenl
- tohhi
- tiowh
- aahepw
etc.
The simplest solution to this issue is to limit the alternation of more than two vowels / consonants in a row:
if (vowelCounter >= maxVowelsInRow) { i -= 1; continue;}
and
if (consonantCounter >= maxConsonantsInRow) { i -= 1; continue;}
Let the values maxConsonantsInRow = 1 and maxVowelsInRow = 1, then the generated words will look something like this:
Note here that “th” and “ae” are digrams, and count as one letter.
The obvious disadvantage of this approach is that the generated words are more of the same type and with much less variative potential. Therefore, in this problem there is a huge scope for improving the algorithm.
The full version of the script can be found here: https://github.com/bernd32/nickname-generator