Introduction

Up to this point, we have used Markov chains to create a trail of agent locations distributed according to predefined steady state probability distribution. Now we want to use this tool to solve other problems.

Using Markov chains to minimize a function

Imagine a function \(f(x)\) defined on an interval \((x_\text{min},x_\text{max})\).
We want to find the minimum value of \(f\).

Start with a simple function:

f <- function(x){ x^2 }

Define a reasonably large number of states:

N <- 100

Define an interval:

xmin <- -3
xmax <-  3

Now create the \(x\)-values. These are the states the agent will visit.

xvals <- seq(xmin, xmax,len=N)

Take a peek at the plot of the values of the function on the sequence we just defined. The minimum is easy to see. We want to find it using a Markov chain!

plot(xvals, f(xvals), cex=.3)

Here’s the key step: From \(f(x)\), we can create a probability distribution on the xvals, via \[ P(x) = e^{-f(x)/\sigma^2} \] where \(\sigma^2\) represents the “variance”. In other words, \(1/sigma^2\) is our precision.

The important thing here is that \(P\) is maximimal where \(f\) is minimal.

Let’s see what this means in practice. Create the probability function from \(f(x)\). Note that the probability function depends on the choice of \(\sigma^2\).

prob <- function(x,sig2) {
    exp(-f(x)/sig2)
}

Let’s plot it for a few values of \(\sigma^2\)

sig2 <- 1
plot(xvals, prob(xvals, sig2), cex=.3)

Now with a smaller, or equivalently more precise, choice of \(\sigma^2\).

sig2 <- 0.1
plot(xvals, prob(xvals, sig2), cex=.3)

Notice how these plots look like normal distributions. Also notice how much “tighter” the second plot is around its maximum value.

Implement a Markov chain sampler.

We will start by creating a doMove function which depends on both the current state and the value of \(\sigma^2\).

This function will work much like our move function from the last class. The algorithm will propose a transition to a new state, and then use a probabilistic approach that might or might not move to the proposed state.

Let \(P_1\) represent the current probability and \(P_2\) the proposed probability: \[ \begin{aligned} P_1 &= e^{-\frac{f(x_1)}{\sigma^2}} \\ P_2 &= e^{-\frac{f(x_2)}{\sigma^2}} \end{aligned} \]

Notice how simple the ratio \(P_1/P_2\) becomes.

As before, we will have the rule that if \(P_2 \ge P_1\), then automatically make the transition. If \(P_2 < P_1\), the make the transition with probability \[ \rho = \frac{P_2}{P_1} = e^{-\frac{f(x_2)-f(x_1)}{\sigma^2}}. \]

Notice that we can compute this key ratio without computing the probabilities directly. We only need the difference in function values.

These two steps can be combined into a single statement.

  • Regardless of the values of \(P_1\) and \(P_2\), define \(\rho\) as above. Note that if \(P_2 > P_1\), then \(\rho > 1\).
  • Generate a uniformly distributed random number \(r \in [0,1]\). If \(r < \rho\), then make the transition!

Implementation

Here’s an implementation of this strategy. In this case, we will think of the state as one of the \(N\) equally spaced values between (xmin, xmax).

First create a function that defines a proposal transition.

proposal <- function(currState){
  # choose a proposal transition at random
  delta <- sample(c(-1,1),1)/N
  propState <- currState + delta
  # avoid going off the end of the interval
  if(propState <= xmin)
    propState <- xmin
  if(propState >= xmax)
    propState <- xmax
  propState
}

Now we can build a move. If you set things up right, this part almost always looks like this:

doMove <- function(currState,sig2){
  # propose a transtion
  propState<-proposal(currState)
  # get function value difference
  dFunc <- f(propState) - f(currState)
  # calculate rho and make a move based on result
  rho <- exp(-dFunc/sig2)
  if(runif(1,0,1) < rho){
    # return the proposed state (this will be the new state)
    return(propState)
  }
  # return the old state (no transition)
  return(currState)
}

Try it out!

Set up: Define the number of steps (varies with the application), a vector to keep the states, an initial state, and a value of ^2$.

M <- 100000
vals <- rep(0,M)
currState <- runif(1, xmin, xmax)
sig2 <- 1.0

Now the main loop.

for(m in 1:M){
  vals[m] <- currState
  currState <- doMove(currState, sig2)
}

Take a peek at the histogram of these values. It should look like a normal distribution with mean \(0\) and variance \(\sigma^2.\) The larger \(M\) is, the more normal this should look.

hist(vals,breaks=100,xlim=c(xmin,xmax))

A more revealing plot is the “trace” of the Markov chain states.

plot(vals, cex=.1)

The trace allows us to see how values of \(x\) move around. Overall they represent a sample from the normal distribution with the fixed value of \(\sigma^2\). The histogram peaks at the minimum value of \(f(x)\), but there is a better way to identify this critical value.

Simulated Annealing

Here is a clever idea: as the Markov chain moves along, slowly decrease \(\sigma^2\), which means the probability distribution gets more tightly clustered around its maximum values. This is a way of “trapping” \(x\) near the maximum value.

This process is called simulated annealing. The key difference is that \(\sigma^2\) is slowly decreased as the chain moves along.

Here is another version of the previous code. Note the only difference is that now \(\sigma^2\) slowly decreases.

M <- 100000
vals <- rep(0,M)
sig2 <- 1
decFac <- .9999
currState <- runif(1, xmin, xmax)
for(m in 1:M){
  vals[m] <- currState
  currState <- doMove(currState,sig2)
  # slowly decrease sig2
  sig2 <- sig2*decFac
}

Note: The method by which \(\sigma^2\) decreases is art, not a science! Even in this simple case, you have to play around with both the initial value of \(\sigma^2\) and the factor by which it decreases each step.

The histogram:

hist(vals,breaks=100,xlim=c(xmin,xmax))

The trace…this is most interesting:

plot(vals,cex=.1,type="l")

Notice now that the trace of the Markov chains funnels down to the value \(x=0\), which the minimum value of \(f(x) = x^2\) on the interval \((-3,3)\).

You might think this is a hard way to find a the minimum of \(x^2\). You’re right, but the idea has wonderful applications in more complicated minimization problems.

The key components are:

Implementation of simulated annealing can usually be broken into three steps.

With any luck, and some fiddling around, you can usually design this process so that minimum value of \(f(x)\) is identified.

Exercise: Repeat with a different function

This process can work for any function. We find a minimum of \(f(x)\) by implementing the Markov chain agent model and then slowly decrease the value \(\sigma^2\) and hope we lock onto the correct value.

Try it with the following function on the interval \((-3,3)\).

f <- function(x) {
  (x-2)*(x-1)*x*(x-2.4)
}

Run your Markov chain several times and compare the output.

If you finish this before the end of class, try some other functions. Can you consistently find the minimum values?