Category: statistics

Junk (filter) science

This is a mirror of a PLOS blogpost. Formatting is usually nicer there.

This is part 4 of a series of introductory posts about the principles of climate modelling. Others in the series: 1 | 2 | 3

In the previous post I said there will always be limits to our scientific understanding and computing power, which means that “all models are wrong.” But it’s not as pessimistic as this quote from George Box seems, because there’s a second half: “… but some are useful.” A model doesn’t have to be perfect to be useful. The hard part is assessing whether a model is a good tool for the job. So the question for this post is:

How do we assess the usefulness of a climate model?

I’ll begin with another question: what does a spam (junk email) filter have in common with state-of-the-art predictions of climate change?

Wall of SPAM
Modified from a photo by freezelight

The answer is they both improve with “Bayesian learning”. Here is a photo of the grave of the Reverend Thomas Bayes, which I took after a meeting at the Royal Statistical Society (gratuitous plug of our related new book, “Risk and Uncertainty Assessment for Natural Hazards”):

Bayes' grave

Bayesian learning starts with a first guess of a probability. A junk email filter has a first guess of the probability of whether an email is spam or not, based on keywords I won’t repeat here. Then you make some observations, by clicking “Junk” or “Not Junk” for different emails. The filter combines the observations with the first guess to make a better prediction. Over time, a spam filter gets better at predicting the probability that an email is spam: it learns.

The filter combines the first guess and observations using a simple mathematical equation called Bayes’ theorem. This describes how you calculate a “conditional probability”, a probability of one thing given something else. Here this is the probability that a new email is spam, given your observations of previous emails. The initial guess is called the “prior” (first) probability, and the new guess after comparing with observations is called the “posterior” (afterwards) probability.

The same equation is used in many state-of-the-art climate predictions. We use a climate model to make a first guess at the probability of future temperature changes. One of the most common approaches for this is to make predictions using many different plausible values of the model parameters (control dials): each “version” of the model gives a slightly different prediction, which we count up to make a probability distribution. Ideally we would compare this initial guess with observations, but unfortunately these aren’t available without (a) waiting a long time, or (b) inventing a time machine. Instead, we also use the climate model to “predict” something we already know, to make a first guess at the probability of something in the past, such as temperature changes from the year 1850 to the present. All the predictions of the future have a twin “prediction of the past”.

We take observations of past temperature changes – weather records – and combine them with the first guess from the climate model using Bayes’ theorem. The way this works is that we test which versions of the model from the first guess (prior probability) of the past are most like the observations: which are the most useful. We then apply those “lessons” by giving these the most prominence, the greatest weight, in our new prediction (posterior probability) of the future. This doesn’t guarantee our prediction will be correct, but it does mean it will be better because it uses evidence we have about the past.

Here’s a graph of two predictions of the probability of a future temperature change (for our purposes it doesn’t matter what) from the UK Climate Projections:

The red curve (prior) is the first guess, made by trying different parameter values in a climate model. The predicted most probable value is a warming of about three degrees Celsius. After including evidence from observations with Bayes’ theorem, the prediction is updated to give the dark blue curve (posterior). In this example the most probable temperature change is the same, but the narrower shape reflects a higher predicted probability for that value.

Probability in this Bayesian approach means “belief” about the most probable thing to happen. That sounds strange, because we think of science as objective. One way to think about it is the probability of something happening in the future versus the probability of something that happened in the past. In the coin flipping test, three heads came up out of four. That’s the past probability, the frequency of how often it happened. What about the next coin toss? Based on the available evidence – if you don’t think the coin is biased, and you don’t think I’m trying to bias the toss – you might predict that the probability of another head is 50%. That’s your belief about what is most probable, given the available evidence.

My use of the word belief might trigger accusations that climate predictions are a matter of faith. But Bayes’ theorem and the interpretation of “probability” as “belief” are not only used in many other areas of science, they are thought by some to describe the entire scientific method. Scientists make a first guess about an uncertain world, collect evidence, and combine these together to update their understanding and predictions. There’s even evidence to suggest that human brains are Bayesian: that we use Bayesian learning when we process information and respond to it.

The next post will be the last in the introductory series on big questions in climate modelling: how can we predict our future?


A Sensitive Subject

Grab yourself a cup of tea. This is a long one.

Yesterday was historic for me. It was the first time I presented a result about ‘climate sensitivity’ (more on this later). This is how it felt to get that result two weeks ago:

In June 2006, we were just a bunch of bright-eyed, bushy-tailed researchers, eager to make a difference in the brave new world of “using palaeodata to reduce uncertainties in climate prediction”. Little did we know the road would be so treacherous, so windy, and so, so long….

Many years later, when we were all old and grey, we finally reached the first of our goals: a preliminary result. On Thursday 12th April 2012, Dr Jonathan C. Rougier produced a plot named, simply, ‘sensitivity.pdf’…

I was presenting these results in a session at the big (11  000 participants) annual European Geosciences Union conference in Vienna. The first speaker was James Hansen, who is rather big in climate circles, and the second David Stainforth, who is first author of the first big result (they dish out climate models for people to run in the background on their computers). I was third, slightly rattled from only finishing my slides just before the session and running through them only once.

If any of you were at the session, I’d prefer you not to talk about our final result, at least for now…it is so preliminary, and I’d prefer this to be a ‘black box’ discussion of our work without prejudice or assumptions from our preliminary numbers.

A little history…

Our project (my first climate job since leaving particle physics) had the rather lovely name of PalaeoQUMP * and the aim of reducing uncertainty about climate sensitivity. By ‘reducing uncertainty’ I mean making the error bars smaller, pinning down the range in which we think the number lies. Climate sensitivity is the global warming you would get if you doubled the concentrations of carbon dioxide in the atmosphere. The earth is slow at reacting to change, so you have to wait until the temperature has stopped changing. Svante Arrhenius (Swedish scientist, 1859-1927) had a go at this in 1896. He did “tedious calculations” by hand and came up with 5.5degC. He added that this was probably too high, and in 1901 revised it to 4degC.

The idea was to reproduce the method of the original lovely-named project QUMP, the internal name given to the Met Office Hadley Centre research into Quantifying Uncertainty in Model Predictions. They compared a large group of climate model simulations with observations of recent climate, to see which were the most realistic and therefore which were more likely to be reliable for predicting the future. QUMP was the foundation for the UK Climate Projections, which provide “information designed to help those needing to plan how they will adapt to a changing climate”. We planned to repeat their work, but looking much further back in time – using what knowledge we have of the climate 6000 years ago (the ‘Mid-Holocene’) and 21 000 years ago (the height of the last ice age, or ‘Last Glacial Maximum’), instead of the recent past.

Fairly early into this project I wrote – with Michel Crucifix and Sandy Harrison – a review paper about people’s efforts to estimate climate sensitivity, which I’ve just put on because I support open science.

PalaeoQUMP ended in 2010 without us publishing any scientific results, for a variety of reasons: ambitious aims, loss of collaborators from the project, and my own personal reasons. Two of the original members – Jonty Rougier (statistician) and Mat Collins (climate modeller, formerly at the Met Office Hadley Centre) – and I continued to work with our climate simulations when we found time. We got distracted along the way from the original goal of climate sensitivity by interesting questions about how best to learn about past climates, but pootled along happily.

But late last year a group of scientists led by Andreas Schmittner published a result that was very similar to our original plan: comparing a large number of climate model simulations to information about the Last Glacial Maximum to try and reduce the uncertainty in climate sensitivity. Their result certainly had a small uncertainty, and it was also much lower than most people had found previously: a 90% probability of being in the range 1.4 to 2.8 degC. This sent a mini-ripple around people interested in climate sensitivity, palaeoclimate and future predictions. The authors were quite critical of their own work, making the possible weak points clear. One of the main weaknesses was that their method needed a very large number of simulations, so they had to use a climate model with a very simple representation of the atmosphere (because it is faster to run). They invited others to repeat their method and test it.

So we took up the gauntlet…

We have a group, an ensemble, of 17 versions of a climate model. The model is called HadCM3, which is a fairly old (but therefore quite fast and well-understood) version of the Hadley Centre climate model. It has a much better representation of the atmosphere than the one used by Andreas Schmittner. In this case “better” is not too controversial: we have atmospheric circulation, they don’t.

We created the different model versions by changing the values of the ‘input parameters’. These are control dials that change the way the model behaves. Unfortunately we don’t know the correct setting for these dials, for lots of reasons: we don’t have the right observations to test them with, or a setting that gives good simulations of temperature might be a bad setting for rainfall. So these are uncertain parameters and we use lots of different settings to create a group of model versions which are all plausible. This group is known as a perturbed parameter ensemble.

We use the ensemble to simulate the Last Glacial Maximum (LGM), the preindustrial period (as a reference), and a climate the same as the preindustrial but with double the CO2 concentrations (to calculate climate sensitivity). We can then compare the LGM simulations to reconstructions of the LGM climate. These reconstructions are based on fossilised plants and animals: by looking at the kinds of species that were fossilised (e.g. something that likes cold climates) and where they lived (e.g. further south than they live today), it is possible to get a surprisingly consistent picture of climates of the past. Reconstructing past climates is difficult, and it’s even harder to estimate the uncertainty, the error bars. I won’t discuss these difficulties in this particular post, and generalised attacks on you know who will not be tolerated in the comments! We used reconstructions of air temperature based on pollen ** and reconstructions of sea surface temperatures based on numerous bugs and things. Andreas Schmittner and his group used the same.

We’re using a shiny new statistical method from Jonty Rougier and two collaborators, which has not yet been published (still in review) but is available online if you want to deluge yourself with charmingly written but quite tricky statistics. It’s a general and simple (compared with previous approaches) way to deal with the very title of this blog: the wrongness of models. The description below is full of ‘saying that’, ‘judge’, ‘reckon’ and so on. Statistics, and science, are full of ‘judgements’: yes, subjectivity. We have to simplify, approximate, and guess-to-the-best-of-our-abilities-and-knowledge. A lot of the statements below are not “This Is The Truth” but more “This Is What We Have Decided To Do To Get An Answer And In Future Work These Decisions May Change”. Please bear this in mind!

Think of an ensemble of climate simulations of temperature. These might be from one model with lots of different values for the control parameters, or they might be completely different models from different research institutes. Most of them look vaguely similar to each other. One is a bit of an oddity. Two look nearly identical. Here is a slightly abstract picture of this:

The crosses in the picture are mostly the same sort of distance from the centre spot, but in different places. One is quite a lot further out. Two are practically on top of each other.

How should we combine all these together to estimate the real temperature? A simple average of everything? Do we give the odd-one-out a smaller contribution? Do we give the near-identical ones smaller contributions too? What if a different model is an oddity for rainfall? Even if we come up with different contributions, different weightings, for each model, the real problem is often relating these back to the original “design” of the ensemble. If our model only has one uncertain parameter, it’s easy. We can steadily increase that control dial for each of the different simulations. Then we compare all the simulations to the real world, find the “best” setting for that parameter, and use this for predicting future climate. This is easy because we know the relationship between each version of the model: each one has a slightly higher setting of the parameter. But if we have a lot of uncertain parameters, it is much harder to find the best settings for all of them at once. It is even worse if we have an ensemble of models from different research institutes, which each have a lot of different uncertain parameters and it is impossible to work out a relationship between all the models.

These problems have given statisticians headaches for several years. We like statisticians, so we want to give them a nice cup of tea and an easier life.

Jonty and Michael and Leanna’s method tries to do make life easier, and begins by asking the question the other way round. Can we throw out some of the models so that the ones that are left are all similar to each other? Then we can stop worrying about how to give them different contributions: we can stop using the individual crosses and just use the average of the rest (the centre spot).

We also don’t need to know the relationship between different models. Instead of using observations of the real world to pick out the “best” model, we will take the average of all of them and let the observations “drag” this average towards reality (I will explain this part later).

How do you decide which models to throw out? This is basically a judgement call. One way is to look at the difference between a model and the average of the others. If any are very far away from the average, chuck them. Another is to squint and look at the simulations and see if any look very different from the others. Yes, really! The point is that it is easier to do this, to justify the decisions, and to use the average, than to decide what contribution to give each model.

The next part to their cunning is reckoning that all the models are equally good – or equally bad, depending on the emptiness or fullness of your glass – at simulating reality. In other words, the models are closer to the ensemble average than reality is. We can add a red star for “reality” outside the cluster of models:

(Notice I’ve now thrown away the outlier model and one of the two near-identical ones.) This is saying that models are probably more like each other than they are like the real world. I think most visitors to this blog would agree…

There is one more decision. The difficulty is not just in combining models but also interpreting the spread in results. Does the ensemble cover the whole range of uncertainty? We think it probably doesn’t, no matter how many models you have or how excellent and cunning your choices in varying the uncertain input parameters. We will say that it does have the same kind of “shape”: maybe the ensemble spread is bigger for Arctic temperatures than for tropical temperatures, so we’ll take that as useful information for the model uncertainty. But we think it should be scaled up, should be multiplied by a number. How much should we scale it up? More on this later…

All of this was just to turn the ensemble into a prediction of LGM temperatures (from the LGM ensemble) and climate sensitivity (from the doubled CO2 ensemble), with uncertainties for each. We will now compare and then combine the LGM temperatures with the reconstructions.

Here is the part where we inflate – actually the technical term, like a balloon – the ensemble spread to give us model uncertainty. How far? The short answer is: until the prediction agrees with the reconstruction. The long answer is a slightly bizarre analogy that comes to mind. Imagine you and a friend are standing about 10 feet apart. You want to hold hands, but you can’t reach. This is what happens if your uncertainties are too small. The prediction and the reconstruction just can’t hold hands; they can’t be friends. Now imagine that you so much want to hold their hand that your arms start growing….growing…growing… until you can reach their hand, perhaps even far enough for a cuddle. You are the model ensemble, and we have just inflated your arms / uncertainty. Your friend is the reconstruction. Your friend’s arms don’t change, because we (choose to) believe the estimates of uncertainty that the reconstruction people give us. But luckily we can inflate your arms, so that now you “agree” with each other. [ For those who want more detail, the hand-holding is a histogram of standardised predictive errors that looks sensible: centred at zero and has most of the mass between [-3,3]. ]

Now we combine the reconstructions with the ensemble “prediction” of the LGM. This gives the best-of-both-worlds. The reconstructions give us information from the real world (albeit more indirect than we would like). The model gives us the link between LGM temperatures and climate sensitivity. The model ensemble and reconstructions are combined in a “fair” way, by taking into account the uncertainties on each side. If the model ensemble has a small uncertainty and the reconstructions have a large uncertainty, then the combined result is closer to the model prediction, and vice versa. This is a weighted average of two things, which is easier than a weighted average of many things (the approach I described earlier). [ Those who want more details: this is essentially a Kalman Filter, but in this context it is known as Bayes Linear updating. ].

To recap:

Reconstructions – we use plant- and bug-based reconstructions of LGM temperatures.

Model prediction – after throwing out models that aren’t very similar, we take the average of the others as our “prediction” of Last Glacial Maximum (LGM) temperatures and climate sensitivity.

Model uncertainty – we multiply the spread of the ensemble by a scaling factor so that the LGM prediction agrees with the reconstructions.

LGM “prediction” – we combine the model prediction with the reconstructions. The combination is closer to whichever has the smallest uncertainty, model or reconstruction.

Now for climate sensitivity. The climate sensitivity gets “dragged” by the reconstructions in the same way as the LGM temperatures. (For this we have to assume that the model uncertainty is the same in the past as the future: this is not at all guaranteed, but inconveniently we don’t have any observations of the future to check). If the LGM “prediction” is generally colder than the LGM reconstructions, it gets dragged to a less-cold LGM and the climate sensitivity gets dragged to a less-warm temperature. And that’s…*jazz hands*….a joint Bayes Linear update of a HadCM3 perturbed parameter ensemble by two LGM proxy-based reconstructions under judgements of ensemble exchangeability and co-exchangeability of reality.

I’m afraid the result itself is going to be a cliffhanger. As I said at the top, I want to talk about the method without being distracted by our preliminary result. But if you’ve got this far…thank you for persevering through my exploratory explanations of some state-of-the-art statistics in climate prediction.

Just as I post this, I am begininning my travels home from Vienna so apologies for comments getting stuck in moderation while I am offline.

Update: I’ve fixed the link to the Rougier et al. manuscript.

Caveat 1. Please note that my descriptions may be a bit over-simplified or, to use the technical term, “hand-wavy”. Our method is slightly different from the statistics manuscript I linked to above, but near enough to be worth reading if you want the technical details. If anyone is keen to see my incomprehensible and stuffed-to-bursting slides, I’ve put them on my page. I’ve hidden the final result of climate sensitivity (and the discussion of it)…

Caveat 2. This work is VERY PRELIMINARY, so don’t tell anyone, ok? Also please be kind – I stayed up too late last night writing this, purely because I am all excited about it.

* Not listed on the PalaeoQUMP website is Ben Booth (who has commented here about his aerosol paper), an honorary member who helped me a lot with the climate modelling.

** N.B. if you want to use the pollen data, contact Pat “Bart” Bartlein for a new version because the old files have a few points with “screwed up missing data codes”, as he put it. These are obvious because the uncertainties are something like 600 degrees.***

*** No jokes about palaeoclimate reconstruction uncertainties please.