Jesse Fagan, PhD

Automating Middle Management

Jesse Fagan — Fri, 25 Sep 2020 08:50:33 GMT

A talk given to the Ex-AI club at the University of Exeter. 25th of September, 2020.

Organizations are a form of technology primed for disruption. The fundamental roles of human organizations (e.g. coordinate activity, distribute resources, allocate tasks, etc) are quickly being handed over to algorithmic control piece by piece. In this talk I will address the advancement of AI in the domain of middle management and the coming rise of self-driving organizations. In my work on email networks, social network analysis, and natural language processing in organizations I have discovered the accuracy and usefulness of using machine learning to predict employee outcomes. It’s a short step from this research to the deployment of predictive or prescriptive systems which automate key staffing decisions. These changes could lead to a healthier working environment where systems optimize for human well-being, or they could go a different direction in which human needs are sidestepped in the interest of alternative goals. The goal of this talk is to continue this important discussion.

Bayes from Scratch, Part 2: Bayesian Regression using Grid Search

Jesse Fagan — Tue, 01 Sep 2020 22:19:26 GMT

This is Part 2 in the Bayes from Scratch series. Read Part 1 - Bayesian Grid Mean here.

In Part 1 I wanted to understand how Bayes Theorem could be used to compute the mean for a single variable without using MCMC, instead I used grid search. A brute force method that computes the posterior for a grid of every combination of parameters within some range.

Miles per Gallon by Engine Displacement

In Part 2 I'm going to try a regression. We just need to add one new parameter, and a structural equation. Miles-per-gallon mpg is distributed by a normal distribution with mean m and standard deviation s. The mean m is determined by the equation $m = \beta \cdot \text{disp} + \alpha$. The engine displacement variable is centered at the mean, so when disp is 0 the alpha parameter should represent the expected mpg for the average vehicle.

$$
mpg \sim Normal(m, s) \\
m = \beta \cdot \text{disp} + \alpha \\
\beta \sim Normal(0,2) \\
\alpha \sim Normal(18, 3) \\
s \sim Uniform(0, 20)
$$

Estimating Parameters using Grid Search

When I first attempted this, I had two predictors in the model (thus 4 parameters), for, you know, fun. This was a mistake. I was quickly reminded the curse of dimensionality. I pared it back to one new parameter. My laptop, with 16 GB of ram, can't actually compute a grid with N = 200.

N <- 200
g <- expand_grid(alpha = seq(1, 100, length.out = N)
                 , beta = seq(-10, 10, length.out = N)
                 , s = seq(0.1,20, length.out = N))

One of the problems with grid searches is you have to specify the bounds the grid a priori. Looking at the plot it's easy to see the slope will be negative, and we already know the mean value of mpg is somewhere between 15 and 25.

Priors: Advantages of Log Probabilities

The priors are pretty simple. The big difference between this run and the previous is that I'm going to be computing log probabilities. There are two great benefits to using log probabilities 1) floating point values can actually store log probabilities, and 2) we can take advantage of the Product Rule for Logarithms.

mpg_alpha_prior <- dnorm(g$alpha, mean = 18, sd = 3, log = TRUE)
mpg_s_prior <- dunif(g$s, min = 0, max = 20, log = TRUE)
mpg_beta_prior <- dnorm(g$beta, mean = 0, sd = 2, log = TRUE)

Some of the priors or posterior values in raw probabilities will be teeny tiny - very close to zero. Some of the values can be $10^{-100}$. This can be a real problem for floating point computation.

The second advantage is the product rule of logarithms. The sum of two logarithms is equal to the logarithm of a product. Bayes Theorem and probability theory relies the product of probabilities in many places, and by using the log probability we can sum the values instead.

$$
\log_{a} xy = \log_{a} x + \log_{a} y
$$

Estimating the Likelihood

The most computationally difficult part is computing the likelihood.

mm <- cbind(1, dcars$disp) # the model matrix
mpg_log_likelihood <- apply(g, 1, FUN = function(r) {
  params <- as.numeric(r[c('alpha', 'beta')])
  m <- as.numeric(mm %*% params)
  sum(dnorm(mtcars$mpg, mean = m, 
            sd = as.numeric(r['s']), log = TRUE))
})

The g data frame has all the grid parameter values. I computed the likelihood at each combination of parameters on the grid. First we compute the value of the linear model using matrix multiplication: m <- as.numeric(mm %*% params) The matrix mm is the design matrix with a column of the data for the engine displacement, disp, and a column of 1's for the intercept. Then, using the logarithm product rule, I compute the likelihood for that combination of parameters: sum(dnorm(mtcars$mpg, mean = m, sd = as.numeric(r['s']), log = TRUE)).

Computing the Posterior

Then computing the posterior is pretty simple - the likelihood times the prior. Or in our case - the log likelihood plus the log prior. The raw posterior probability can be computed and normalized after computing the log posterior.

d <- g %>% 
  mutate(mpg_ll = mpg_log_likelihood) %>% 
  mutate(post = exp(mpg_log_post) / sum(exp(mpg_log_post)))

Posterior Checks

Let's see if this posterior makes sense. If we just randomly sample 1000 lines from the grid data frame and plot the different lines on a plot of the data we get something like this.

i <- sample(1:nrow(d), size = 1000, replace = TRUE)

And when I weight the grid samples based on the posterior probability we get something much more reasonable:

i <- sample(1:nrow(d), size = 1000, prob = d$post, replace = TRUE)
# i <- sample(1:nrow(d), size = 1000, replace = TRUE)
ggplot() +
  geom_abline(data = d[i,], aes(slope = beta, intercept = alpha, color = beta),
              alpha = 0.3) +
  scale_color_viridis_c(option = 'B') +
  geom_point(data = dcars, aes(x = disp, y = mpg), ) +
  coord_cartesian(xlim = c(min(dcars$disp), max(dcars$disp))) +
  theme_fivethirtyeight()

Thoughts

My first thought - hey, this works! My second thought - this is incredibly imprecise and computationally expensive! I did the last few plots on my laptop, so I had to reduce the grid size dramatically compared to my beefy desktop machine. This is a really simple regression and the thought that it can't be computed on a relatively powerful laptop is a good indication that this approach is a bad choice.

But it shows that using Bayes Theorem to estimate models can be done simply - even if it is expensive. When you learn Bayesian statistics there is often so much tied up in trying to understand MCMC. This approach ignores MCMC entirely. Soon though I will write MCMC from scratch as well and pick it apart.

Bayes From Scratch, Part 1: Computing the Posterior Mean using Grid Search

Jesse Fagan — Sat, 13 Jun 2020 11:35:16 GMT

I think the best way to learn new statistical methods is to code them myself from scratch. Crack open the papers and books, read the formulas, and build my own software. I don't plan on making software for release, this is a learning exercise. Here is the roadmap of some things I want to figure out in Bayesian statistics:

Grid search posterior mean
Grid search posterior regression
Highest Posterior Density Interval
Metropolis-Hastings algorithm for Markov Chain Monte Carlo
Evaluating convergence (Rhat, $\widehat{R}$ and effective sample size, EFF)
Multilevel models
Hamiltonian MCMC

I feel like if I can code those things from scratch I will have a much better understanding of the problems that come up with convergence, or how to interpret results.

The data

For this, I'm doing a super boring dataset that's built into base-R datasets library: mtcars. I'm going to compute the mean miles-per-gallon using my prior understanding of what the average miles per gallon should be. So here's the data, there are 32 cases, a mean of 20.09 and median of 19.20.

Grid Search

In Richard McElreath's book^[1] he demonstrates estimating the posterior density of the proportion of the Earth that's covered in water. Kruschke's book^[3] also does an example with a binomial distribution with one parameter. I quickly realized why they chose this approach rather than computing a Bayesian mean. For a binomial density estimate you only need to know the number of attempts, the number of successes, and the probability of success. Thus it's pretty simple. The data is two numbers and you only have one parameter. But to compute the posterior mean you need two parameters the mean, $\mu$, and the standard deviation, $\sigma$, and the data is the vector of all the measurements of the variable.

It wasn't clear to me about how to adapt grid search to two parameters from one. The focus of only one parameter left me with a misconception. Because of the very simple introductions I was reading I thought maybe the $\theta$ in ye olde Bayes rule, seen below, was a single parameter. But it represents all the parameters in the entire model.

From [3]

So here is the model I'm going to estimate. The $mpg$ is normally distributed with a mean of $m$ and a standard deviation of $s$. And I have defined some prior distributions for $m$ and $s$ that I discuss m

$$
mpg \sim Normal(m, s) \\
m \sim Normal(13, 3) \\
s \sim Uniform(0, 20)
$$

Following in McElreath's path (Rethinking, pg 40), we're going to take the grid search steps he outlined, with some minor modifications.

Define a grid. All the possible combinations of parameters we might want to use in estimating a posterior all laid out in a grid.
Compute the value of the prior for each pair of parameter values on the grid
Compute the likelihood at each pair of parameter values.
Compute the unstandardized posterior at each pair of parameter values by multiplying the prior and likelihood.
After all that, we standardize the posterior by dividing each value by the sum of all values

So let's just start with what the grid looks like. It's 100 values between 1 and 100 for the mean. And 100 values between 0.1 and 10 for the standard deviation. Already we are showing some of the downsides of this approach. There are only two parameters, and I'm only taking 100 values, but already this table has 10,000 rows ($100 \times 100$). Also what range should we reasonably choose for the values? These are arbitrary decisions. There are some reasonable values for it, but the values should represent what we expect to be the very extremes of the posterior.

N <- 100
g <- expand_grid(m = seq(1, 100, length.out = N),
            s = seq(0.1,10, length.out = N))

> head(g)
# A tibble: 6 x 2
      m     s
   
1     1   0.1
2     1   0.2
3     1   0.3
4     1   0.4
5     1   0.5
6     1   0.6

Priors

Bayesian statistics deviates from classical, frequentist statistics in a very key way - it considers prior knowledge of the systems we are studying as part of its calculation. The choice of the prior can be from any information you may have prior to examining the data.

The priors have some good information. I choose a mean of 18 for the prior mpg because I had a car that regularly had 18 miles per gallon (now I live in the UK where that would be horrible, but it was somewhat common in the US). A frequentist approach would ignore prior knowledge about how the real world would work. For instance, without having any real values of fuel economy, we know that miles-per-gallon won't be negative. So our prior in this case greatly reduces the chances of seeing that.

mpg_m_prior <- dnorm(g$m, mean = 18, sd = 3)
mpg_s_prior <- dunif(g$s, min = 0, max = 20)

The prior density is computed for every value of m and s in the grid.

I think one issue with this prior is that it's probably too strong. In the UK, for instance, my tiny Yaris gets 40 mpg easily. If I had collected data in the UK using my prior knowledge from the US, the prior would push me to the wrong answer.

Likelihood

The likelihood is the probability of the data given some parameter values. I got hung up here for some time. None of the books really make it up-front and clear. It wasn't until I saw a post on Cross Validated that I figure out what to do. I had always understood that the likelihood was the observations, but I didn't know how you could transform that into a distribution that could be multiplied by the prior. So here's what we do:

Choose two of the parameter values from the grid (the mean and standard deviation)
For each value of the data compute the probability using the density function using those two parameter values
Multiply all of these probabilities together
That's your likelihood probability density estimate for that point in the parameter space, for that pair of the mean and standard deviation

$$f(\mathbf{Y}|\theta)= \prod_{i=1}^n f(y_i|\theta) $$

It seems obvious in retrospect how to do this, but it didn't make sense at first.

mpg_likelihood <- apply(g, 1, FUN = function(r) {
  prod(dnorm(mtcars$mpg, mean = r['m'], sd = r['s']))
})

The R code does exactly what was describe above. It uses the apply function to use each row of the grid table. Then it computes the probability of observing each mpg value we observed if the mean and standard deviation were the values we got from the grid.

Log probabilities

One thing I found was that the probabilities these functions produce were super itty bitty - even using values that were very close to what the actual values were. This presents some difficult problems with floating point arithmetic.

> prod(dnorm(mtcars$mpg, mean = 18, sd = 7))
[1] 3.767108e-46

Which is one of the reasons to use log probabilities. When you use the log probability density the values are far more manageable. You also sum each of values instead of multiplying them (as a consequence of the math of logarithms).

> sum(dnorm(mtcars$mpg, mean = 18, sd = 7, log = TRUE))
[1] -104.5926

However, this isn't explicitly called out in the math of the basic bayesian introductions. I will be doing this from now on, after this experiment.

The Posterior

Now that the likelihood and prior have been computed.... we just multiply them! That creates the unstandardized posterior, then we divide it by the sum of itself. Bam. Posterior.

d <- tibble(m = g$m, s = g$s, 
       mpg_m_prior, mpg_s_prior, mpg_likelihood) %>% 
  mutate(mpg_post_unstd = mpg_m_prior * mpg_s_prior * mpg_likelihood) %>% 
  mutate(post = mpg_post_unstd / sum(mpg_post_unstd))

And here's a visualization of what that looks like. The posterior here is two-dimensional, so I used geom_tile to get sense of it.

ggplot(d, aes(x = m, y = s, fill = post, z= mpg_post_unstd)) +
  geom_tile() +
  scale_fill_viridis_c() +
  theme_minimal()

And the Bayesian mean here is 20.

> d[which.max(d$post),]
# A tibble: 1 x 7
      m     s mpg_m_prior mpg_s_prior mpg_likelihood mpg_post_unstd   post
                                       
1    20   5.9       0.106        0.05       3.43e-45       1.83e-47 0.0201

The precision of this estimate is restricted by the resolution of the grid. It's exactly 20 because the grid doesn't have values for 20.1 or such.

I increased N to 1000 instead of 100 (thus there were 1,000,000 combinations of parameter values), and I estimated the mean as 19.8.

> d[which.max(d$post),]
# A tibble: 1 x 7
      m     s mpg_m_prior mpg_s_prior mpg_likelihood mpg_post_unstd     post
                                         
1  19.8  5.94       0.110        0.05       3.34e-45       1.85e-47 0.000199

Highest Posterior Density Interval

The posterior is not a point though - it's a distribution. I just used the highest value of the posterior in the grid to select a point value. This gives us some options, do we choose the maximum? - the median? How do we define the credible interval? That's what I wanted to explore in the next post.

[1]: McElreath, R. 2020. Statistical rethinking: A Bayesian course with examples in R and Stan. Boca Raton, FL: Chapman & Hall/CRC Press.

[2]: Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., & Rubin, D. B. (2013). Bayesian data analysis. CRC press.

[3]: Kruschke, J. (2014). Doing Bayesian data analysis: A tutorial with R, JAGS, and Stan. Academic Press.