Example 5.3#

Word level simulation of English using Markov chains. The data used in this example can be downloaded from here.

import re
from collections import defaultdict, Counter
import numpy as np
from random import randint, choice
import matplotlib.pyplot as plt
import time

rng = np.random.default_rng(19)

# Reading the file
with open('holmes.txt') as f:
    words = re.split(' +', f.read())

# Assign an index to each unique word
word_to_index = {word: idx for idx, word in enumerate(set(words))}
index_to_word = {idx: word for word, idx in word_to_index.items()}

# Create transition counts matrix
num_words = len(word_to_index)
transition_counts = np.zeros((num_words, num_words), dtype=int)

for w0, w1 in zip(words[:-1], words[1:]):
    transition_counts[word_to_index[w0], word_to_index[w1]] += 1

# Convert counts to probabilities to get the transition matrix
transition_matrix = transition_counts / transition_counts.sum(axis=1, keepdims=True)

# Handle potential NaNs (rows with no transitions)
transition_matrix[np.isnan(transition_matrix)] = 0

# visualise the transition matrix

# plt.imshow((transition_matrix + 1e-8))
# plt.colorbar()
# plt.show()

# simulate a sentence
T = 100
# w0 = choice(words[:-1])
w0 = 'Sherlock'
print('The first word is', w0, '\n')

print(w0)

for _ in range(T):
    # get the index of w0
    idx = word_to_index[w0]
    # sample the next index using the transition matrix
    idx_next = rng.choice(num_words, p=transition_matrix[idx, :])
    # convert the index back to the word
    w1 = index_to_word[idx_next]
    print(w1, end=' ')
    w0 = w1
The first word is Sherlock 

Sherlock
Holmes sprang into the hotel, and unslinging of paper stuck in you, I believe in his heart which Drebber failed to the Church.”

“It will be here.”

“Yes. It will have to have perplexed you provide a conviction against an assumed a fair-haired girl was in a gentleman,
well dressed, and the blood is subject of donations received the case,
Doctor. You can only knows. Ye see, when the E. J. Drebber said
that he has
 agreed to come?”

“Yes, if you all the
mazes that Mr. Bender, he had returned. He had done. And yet a
slip would puzzle the money. I heard nothing, and over six 
<ipython-input-1-9aaed35cf61a>:26: RuntimeWarning: invalid value encountered in divide
  transition_matrix = transition_counts / transition_counts.sum(axis=1, keepdims=True)