Example 5.2#

Character level simulation of English using Markov chains. The data that is loaded in this notebook can be downloaded from here.

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(42)

def sample(s, w): # discrete sampler for a uniform random variable u, states s and weights w
    u = rng.uniform(0, 1)
    cdf = np.cumsum(w)
    sample_ind = np.argmax(cdf >= u)
    return s[sample_ind]

# read csv transitions extracted before hand from `War and Peace`
TranMatrix = np.genfromtxt('transitions.csv', delimiter=',')

# get the states (all letters and a dot)
states = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 
'r', 's', 't',  'u', 'v', 'w', 'x', 'y', 'z', '.']

# simuate T steps from this chain
T = 1000
X = np.zeros(T, dtype=int)
Y = []

initState = 6
X[0] = initState
Y.append(states[initState])

s = np.arange(len(states))

# print(TranMatrix.shape)
# print(len(states))

for k in range(1, T):
    # random sampling
    X[k] = sample(s, TranMatrix[X[k-1],:])
    Y.append(states[X[k]])

# merge Y as a single text
Y = ''.join(Y)
print(Y)
g.l.se.t.bin.s.lese.ry..wolked.t.hered.e.ly.hr.impefatrt.mofe.mouroreand.thame.ourithad.mbld.limonde.wil.tuned.lensht.thioteranfed.ctonsa.muche.orenoocr.bld.gus.bysckit.h.gom.can.wn.whe.spath.fe.cet.dr..ndald.blpoly.ted.imir.g.arsed..pust.af.thevionglowhelutecld.hayondeacearthes.teelessched.ayow..t.s.bll.meytherepherth.whes.s.othig.uevate.by.pr..ftevesunousto.bed.is.h.thende.list.tin.t.buretsan.gngh.te.we.b.s.awald.d.scaclys.at.ts.ald.towathe..paraveedi.murof.he.s.blindenerrssshenchellls.sourionout.nouly.be.ameenye.und.be..thiss.alldht.ow.bind.llditelerorce.s.s.anon.mutitowerofo.rendazensampo.rmutu.t.brint..g.thacor.m.ises.wodyed.t.ve.alowappan.siththere.canus.gies.swh.mandibiqun.tle.en.oursmend.gr.the.s.woknd..nga.acechencancoleriag.apler.thellome.anthaivalafulllyaure.ul.lm.ithoshenddsais.th.paishe.ab.of.whilare.tare.furirury.inatind.ise.ce.wervinous.of.vidisa.is.insuns.meriee.h.fe.weffrlean.stele.w....aimof.roghcroulin.th.budowanghit.heled.wornoried.m.ckelag.herind.twhay.an.ry.d.loa.

We can however still use this model to compute the probability of a given sentence. Let us consider the sentence “the quick brown fox jumps over the lazy dog”.

sentence = 'he.walked.about.in.front.of.the.line.and.at'
sentence2 = 'g.l.se.t.bin.s.lese.ry..wolked.t.hered.e.ly'
sentence3 = 'eeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeeee'

state_inds = [states.index(c) for c in sentence]
state_inds2 = [states.index(c) for c in sentence2]
state_inds3 = [states.index(c) for c in sentence3]

log_prob = 0
log_prob_2 = 0
log_prob_3 = 0
# Compute log probabilities
for t in range(1, len(state_inds)):
    log_prob += np.log(TranMatrix[state_inds[t-1], state_inds[t]])
    log_prob_2 += np.log(TranMatrix[state_inds2[t-1], state_inds2[t]])
    log_prob_3 += np.log(TranMatrix[state_inds3[t-1], state_inds3[t]])

print(log_prob)
print(log_prob_2)
print(log_prob_3)
-82.9320202715183
-92.65260179128019
-144.94757135101216

Here we can see that while these probabilities are not meaningful per se, the model can still distinguish between improbable and probable sentences.