A Sensitive Subject

Grab yourself a cup of tea. This is a long one.

Yesterday was historic for me. It was the first time I presented a result about ‘climate sensitivity’ (more on this later). This is how it felt to get that result two weeks ago:

In June 2006, we were just a bunch of bright-eyed, bushy-tailed researchers, eager to make a difference in the brave new world of “using palaeodata to reduce uncertainties in climate prediction”. Little did we know the road would be so treacherous, so windy, and so, so long….

Many years later, when we were all old and grey, we finally reached the first of our goals: a preliminary result. On Thursday 12th April 2012, Dr Jonathan C. Rougier produced a plot named, simply, ‘sensitivity.pdf’…

I was presenting these results in a session at the big (11  000 participants) annual European Geosciences Union conference in Vienna. The first speaker was James Hansen, who is rather big in climate circles, and the second David Stainforth, who is first author of the first big climateprediction.net result (they dish out climate models for people to run in the background on their computers). I was third, slightly rattled from only finishing my slides just before the session and running through them only once.

If any of you were at the session, I’d prefer you not to talk about our final result, at least for now…it is so preliminary, and I’d prefer this to be a ‘black box’ discussion of our work without prejudice or assumptions from our preliminary numbers.

A little history…

Our project (my first climate job since leaving particle physics) had the rather lovely name of PalaeoQUMP * and the aim of reducing uncertainty about climate sensitivity. By ‘reducing uncertainty’ I mean making the error bars smaller, pinning down the range in which we think the number lies. Climate sensitivity is the global warming you would get if you doubled the concentrations of carbon dioxide in the atmosphere. The earth is slow at reacting to change, so you have to wait until the temperature has stopped changing. Svante Arrhenius (Swedish scientist, 1859-1927) had a go at this in 1896. He did “tedious calculations” by hand and came up with 5.5degC. He added that this was probably too high, and in 1901 revised it to 4degC.

The idea was to reproduce the method of the original lovely-named project QUMP, the internal name given to the Met Office Hadley Centre research into Quantifying Uncertainty in Model Predictions. They compared a large group of climate model simulations with observations of recent climate, to see which were the most realistic and therefore which were more likely to be reliable for predicting the future. QUMP was the foundation for the UK Climate Projections, which provide “information designed to help those needing to plan how they will adapt to a changing climate”. We planned to repeat their work, but looking much further back in time – using what knowledge we have of the climate 6000 years ago (the ‘Mid-Holocene’) and 21 000 years ago (the height of the last ice age, or ‘Last Glacial Maximum’), instead of the recent past.

Fairly early into this project I wrote – with Michel Crucifix and Sandy Harrison – a review paper about people’s efforts to estimate climate sensitivity, which I’ve just put on arxiv.org because I support open science.

PalaeoQUMP ended in 2010 without us publishing any scientific results, for a variety of reasons: ambitious aims, loss of collaborators from the project, and my own personal reasons. Two of the original members – Jonty Rougier (statistician) and Mat Collins (climate modeller, formerly at the Met Office Hadley Centre) – and I continued to work with our climate simulations when we found time. We got distracted along the way from the original goal of climate sensitivity by interesting questions about how best to learn about past climates, but pootled along happily.

But late last year a group of scientists led by Andreas Schmittner published a result that was very similar to our original plan: comparing a large number of climate model simulations to information about the Last Glacial Maximum to try and reduce the uncertainty in climate sensitivity. Their result certainly had a small uncertainty, and it was also much lower than most people had found previously: a 90% probability of being in the range 1.4 to 2.8 degC. This sent a mini-ripple around people interested in climate sensitivity, palaeoclimate and future predictions. The authors were quite critical of their own work, making the possible weak points clear. One of the main weaknesses was that their method needed a very large number of simulations, so they had to use a climate model with a very simple representation of the atmosphere (because it is faster to run). They invited others to repeat their method and test it.

So we took up the gauntlet…

We have a group, an ensemble, of 17 versions of a climate model. The model is called HadCM3, which is a fairly old (but therefore quite fast and well-understood) version of the Hadley Centre climate model. It has a much better representation of the atmosphere than the one used by Andreas Schmittner. In this case “better” is not too controversial: we have atmospheric circulation, they don’t.

We created the different model versions by changing the values of the ‘input parameters’. These are control dials that change the way the model behaves. Unfortunately we don’t know the correct setting for these dials, for lots of reasons: we don’t have the right observations to test them with, or a setting that gives good simulations of temperature might be a bad setting for rainfall. So these are uncertain parameters and we use lots of different settings to create a group of model versions which are all plausible. This group is known as a perturbed parameter ensemble.

We use the ensemble to simulate the Last Glacial Maximum (LGM), the preindustrial period (as a reference), and a climate the same as the preindustrial but with double the CO2 concentrations (to calculate climate sensitivity). We can then compare the LGM simulations to reconstructions of the LGM climate. These reconstructions are based on fossilised plants and animals: by looking at the kinds of species that were fossilised (e.g. something that likes cold climates) and where they lived (e.g. further south than they live today), it is possible to get a surprisingly consistent picture of climates of the past. Reconstructing past climates is difficult, and it’s even harder to estimate the uncertainty, the error bars. I won’t discuss these difficulties in this particular post, and generalised attacks on you know who will not be tolerated in the comments! We used reconstructions of air temperature based on pollen ** and reconstructions of sea surface temperatures based on numerous bugs and things. Andreas Schmittner and his group used the same.

We’re using a shiny new statistical method from Jonty Rougier and two collaborators, which has not yet been published (still in review) but is available online if you want to deluge yourself with charmingly written but quite tricky statistics. It’s a general and simple (compared with previous approaches) way to deal with the very title of this blog: the wrongness of models. The description below is full of ‘saying that’, ‘judge’, ‘reckon’ and so on. Statistics, and science, are full of ‘judgements’: yes, subjectivity. We have to simplify, approximate, and guess-to-the-best-of-our-abilities-and-knowledge. A lot of the statements below are not “This Is The Truth” but more “This Is What We Have Decided To Do To Get An Answer And In Future Work These Decisions May Change”. Please bear this in mind!

Think of an ensemble of climate simulations of temperature. These might be from one model with lots of different values for the control parameters, or they might be completely different models from different research institutes. Most of them look vaguely similar to each other. One is a bit of an oddity. Two look nearly identical. Here is a slightly abstract picture of this:

The crosses in the picture are mostly the same sort of distance from the centre spot, but in different places. One is quite a lot further out. Two are practically on top of each other.

How should we combine all these together to estimate the real temperature? A simple average of everything? Do we give the odd-one-out a smaller contribution? Do we give the near-identical ones smaller contributions too? What if a different model is an oddity for rainfall? Even if we come up with different contributions, different weightings, for each model, the real problem is often relating these back to the original “design” of the ensemble. If our model only has one uncertain parameter, it’s easy. We can steadily increase that control dial for each of the different simulations. Then we compare all the simulations to the real world, find the “best” setting for that parameter, and use this for predicting future climate. This is easy because we know the relationship between each version of the model: each one has a slightly higher setting of the parameter. But if we have a lot of uncertain parameters, it is much harder to find the best settings for all of them at once. It is even worse if we have an ensemble of models from different research institutes, which each have a lot of different uncertain parameters and it is impossible to work out a relationship between all the models.

These problems have given statisticians headaches for several years. We like statisticians, so we want to give them a nice cup of tea and an easier life.

Jonty and Michael and Leanna’s method tries to do make life easier, and begins by asking the question the other way round. Can we throw out some of the models so that the ones that are left are all similar to each other? Then we can stop worrying about how to give them different contributions: we can stop using the individual crosses and just use the average of the rest (the centre spot).

We also don’t need to know the relationship between different models. Instead of using observations of the real world to pick out the “best” model, we will take the average of all of them and let the observations “drag” this average towards reality (I will explain this part later).

How do you decide which models to throw out? This is basically a judgement call. One way is to look at the difference between a model and the average of the others. If any are very far away from the average, chuck them. Another is to squint and look at the simulations and see if any look very different from the others. Yes, really! The point is that it is easier to do this, to justify the decisions, and to use the average, than to decide what contribution to give each model.

The next part to their cunning is reckoning that all the models are equally good – or equally bad, depending on the emptiness or fullness of your glass – at simulating reality. In other words, the models are closer to the ensemble average than reality is. We can add a red star for “reality” outside the cluster of models:

(Notice I’ve now thrown away the outlier model and one of the two near-identical ones.) This is saying that models are probably more like each other than they are like the real world. I think most visitors to this blog would agree…

There is one more decision. The difficulty is not just in combining models but also interpreting the spread in results. Does the ensemble cover the whole range of uncertainty? We think it probably doesn’t, no matter how many models you have or how excellent and cunning your choices in varying the uncertain input parameters. We will say that it does have the same kind of “shape”: maybe the ensemble spread is bigger for Arctic temperatures than for tropical temperatures, so we’ll take that as useful information for the model uncertainty. But we think it should be scaled up, should be multiplied by a number. How much should we scale it up? More on this later…

All of this was just to turn the ensemble into a prediction of LGM temperatures (from the LGM ensemble) and climate sensitivity (from the doubled CO2 ensemble), with uncertainties for each. We will now compare and then combine the LGM temperatures with the reconstructions.

Here is the part where we inflate – actually the technical term, like a balloon – the ensemble spread to give us model uncertainty. How far? The short answer is: until the prediction agrees with the reconstruction. The long answer is a slightly bizarre analogy that comes to mind. Imagine you and a friend are standing about 10 feet apart. You want to hold hands, but you can’t reach. This is what happens if your uncertainties are too small. The prediction and the reconstruction just can’t hold hands; they can’t be friends. Now imagine that you so much want to hold their hand that your arms start growing….growing…growing… until you can reach their hand, perhaps even far enough for a cuddle. You are the model ensemble, and we have just inflated your arms / uncertainty. Your friend is the reconstruction. Your friend’s arms don’t change, because we (choose to) believe the estimates of uncertainty that the reconstruction people give us. But luckily we can inflate your arms, so that now you “agree” with each other. [ For those who want more detail, the hand-holding is a histogram of standardised predictive errors that looks sensible: centred at zero and has most of the mass between [-3,3]. ]

Now we combine the reconstructions with the ensemble “prediction” of the LGM. This gives the best-of-both-worlds. The reconstructions give us information from the real world (albeit more indirect than we would like). The model gives us the link between LGM temperatures and climate sensitivity. The model ensemble and reconstructions are combined in a “fair” way, by taking into account the uncertainties on each side. If the model ensemble has a small uncertainty and the reconstructions have a large uncertainty, then the combined result is closer to the model prediction, and vice versa. This is a weighted average of two things, which is easier than a weighted average of many things (the approach I described earlier). [ Those who want more details: this is essentially a Kalman Filter, but in this context it is known as Bayes Linear updating. ].

To recap:

Reconstructions – we use plant- and bug-based reconstructions of LGM temperatures.

Model prediction – after throwing out models that aren’t very similar, we take the average of the others as our “prediction” of Last Glacial Maximum (LGM) temperatures and climate sensitivity.

Model uncertainty – we multiply the spread of the ensemble by a scaling factor so that the LGM prediction agrees with the reconstructions.

LGM “prediction” – we combine the model prediction with the reconstructions. The combination is closer to whichever has the smallest uncertainty, model or reconstruction.

Now for climate sensitivity. The climate sensitivity gets “dragged” by the reconstructions in the same way as the LGM temperatures. (For this we have to assume that the model uncertainty is the same in the past as the future: this is not at all guaranteed, but inconveniently we don’t have any observations of the future to check). If the LGM “prediction” is generally colder than the LGM reconstructions, it gets dragged to a less-cold LGM and the climate sensitivity gets dragged to a less-warm temperature. And that’s…*jazz hands*….a joint Bayes Linear update of a HadCM3 perturbed parameter ensemble by two LGM proxy-based reconstructions under judgements of ensemble exchangeability and co-exchangeability of reality.

I’m afraid the result itself is going to be a cliffhanger. As I said at the top, I want to talk about the method without being distracted by our preliminary result. But if you’ve got this far…thank you for persevering through my exploratory explanations of some state-of-the-art statistics in climate prediction.

Just as I post this, I am begininning my travels home from Vienna so apologies for comments getting stuck in moderation while I am offline.

Update: I’ve fixed the link to the Rougier et al. manuscript.

Caveat 1. Please note that my descriptions may be a bit over-simplified or, to use the technical term, “hand-wavy”. Our method is slightly different from the statistics manuscript I linked to above, but near enough to be worth reading if you want the technical details. If anyone is keen to see my incomprehensible and stuffed-to-bursting slides, I’ve put them on my Academia.edu page. I’ve hidden the final result of climate sensitivity (and the discussion of it)…

Caveat 2. This work is VERY PRELIMINARY, so don’t tell anyone, ok? Also please be kind – I stayed up too late last night writing this, purely because I am all excited about it.

* Not listed on the PalaeoQUMP website is Ben Booth (who has commented here about his aerosol paper), an honorary member who helped me a lot with the climate modelling.

** N.B. if you want to use the pollen data, contact Pat “Bart” Bartlein for a new version because the old files have a few points with “screwed up missing data codes”, as he put it. These are obvious because the uncertainties are something like 600 degrees.***

*** No jokes about palaeoclimate reconstruction uncertainties please.


  1. James Monk

    Is it really valid to “chuck out” the outlying models with no a priori reason? If by “model” here you mean a different variation on the same set of model parameters (rather than a completely different model that has different parameters), then by chucking out the outliers are you not simply finding the region close to where the differential of the model output (temperature or whatever) in the parameters is close to zero? Isn’t this just the density of the models Vs. temperature (or whatever other output)?

  2. Tamsin Edwards

    I’ve probably made things confusing by conflating the aims of Jonty’s paper, which is targeted at multi-model ensembles (different research groups) with this work, which adapts it to our perturbed parameter ensemble. *We* don’t chuck out any of our models. You make a good point, but we are varying more parameters (31!) than we have ensemble members (17) so we can’t see local minima/maxima. Don’t blame us for the ensemble design, it’s not ours 🙂

    Here’s an excerpt from a document I wrote for Jonty, which he edited:

    “We have a priori grounds for treating the ensemble as exchangeable, due to the large number of parameters that were perturbed, and the consequent distance between any two members of the ensemble in 31-dimensional parameter space. We judge that distance to be at least as large as the correlation length of our judgements.”

    I am still trying to understand what the correlation length of my judgement is 🙂

    With the multi-model ensemble, the a priori reason is the distance from the mean of the others (p21 in Jonty’s manuscript).

    • James Monk

      31 parameters!? How many data points are you fitting?

      I agree there’s an important difference between parameter ensembles (we’d call a parameter set a “tune”) and completely different models that don’t even share parameters (e.g. you mentioned that some models don’t include circulation). It tells you whether you are missing important model features out completely.

      I am trying to digest your excerpt. I think it means you have no a priori reason for thinking any tune better than any other tune, and therefore they should be symmetrical to being swapped. An outlier would presumably break this symmetry, so you chuck it (except that *you* don’t). That seems a bit weird to me – what if one of the outliers got lucky and completely nailed the data (unlikely, but possible). Why would you want to reject it? Or have I misunderstood?

      Looking forward to your seminar in a few weeks. Is this going to be in it?

      • Tamsin Edwards

        I don’t have the number of points to hand but you can get an idea from the maps in my talk – 2nd to last page I think. But we’re not trying to estimate those parameters. We’re only varying them to estimate the model uncertainty.

        Your question is related to Nick’s. I suppose we could think of an extreme example. Say we had a multi-model ensemble where one was a state-of-the-art ultra-high resolution fancy pants climate model and all the rest were simple and low resolution. Yes, the more complex one will probably be closer to the real climate, but we need replications for uncertainty estimation so we can’t ditch the others. And if we use them all, the exchangeability theorem that lets us represent an ensemble with its mean doesn’t apply (intuitively it wouldn’t feel right to do an unweighted mean). So we have to, painfully, exclude the good model and use the rest. That means we will have to have a large uncertainty (inflate the variance a lot). But that’s better than (a) a wrong uncertainty estimate, if we use them all, or (b) no uncertainty estimate, if we use one good model.

        Does that convince you?

        • James Monk

          Not really 🙂 If you have a sophisticated model and a simpler (often older) model and the more sophisticated model agrees better with the data, well then that’s why the model builders added the new or different features in the first place! Treating the models as equals when we know full well that some are more sophisticated than others doesn’t seem right to me.

          Maybe we mean different things by model uncertainty. What I mean is this: each model has an optimum set of parameter values, which maximise the agreement with data (or minimises the disagreement 😉 . To find the model uncertainty, I would ask the question “how far can I flap the parameter values about without significantly worsening the (dis)agreement of the model with data?” Now when I come to make a new prediction, I can use the outputs of those flapped-about parameter sets as a starting estimate of my model uncertainty. Maybe some model parameters aren’t well constrained by existing data, in which case they can roam freely and any prediction that is sensitive to them is going to have a large uncertainty. Maybe some parameters are well constrained, and then the opposite is true.

  3. Mark M


    I saw you post over at Judith’s. Thanks for reference! I can’t wait for part two as your approach to the problem(s) of which is the best model(s) to use seems spot on to me.

  4. BillC


    Maybe it’s because you wrote the latter late at night in an excited state, or maybe it’s because I’ve sunk to a new low of only being able to comprehend academe-speak (sigh) but I thought your abstract on Academia.org was rather clearer than your blog post.

    I’ve been trying to think of an intelligent question – really! – and the best I can do is – given the use of higher-detail models and multiple reconstruction timeframes (vs Schmittner), how prominently does the use of a new statistical method affect your result?

    • Tamsin Edwards

      Yes, if you’re used to the jargon it’s often much quicker to understand it that way. But I was aiming to make it understandable to friends and family that read this, and anyone that hasn’t seen much science since school. I agree it could be explained better though (and adding in the maps from my talk would have helped).

      I’m very interested to find out which is more important, a different climate model or a different statistical method. (we haven’t included the Mid-Holocene yet). I expect they interplay in important ways. For example, the model they use couldn’t get the difference between land and sea temperatures right so it could only match the land (pollen) data or the ocean data at one time, not both at once. But a large enough estimate of model uncertainty should account for this, I think. If our model is better *and* we add more model uncertainty then perhaps this mismatch between the data sets will go away (being coy about prelim result, sorry!). Also, I think their statistical model is only really suited to very smooth climate models, which theirs is. So even if we had the same ensemble size etc, if we used their method on our climate model I think it wouldn’t be reliable.

      This is speculation right now though.

  5. Nick

    I have to agree that throwing out models because of inconsistency, while logically tempting, might not be getting the full picture if, say, reality actually has more than one characteristic behaviour (as in nonlinear systems there can be more there can exist multiple solutions).

    And well done on the result Tamsin!

    On a side note, I always prefer talking about “certainty” without the “un-“. I think the Manchester Physics Department “Error Bar” tagline “there are no mistakes in our coffee” missed a trick in not playing on this with something like “we’re confident about the quality of our coffee”…

    • Tamsin Edwards

      Mm, yes. I guess my first reply is that this is a pragmatic and simple-to-use approach rather than a most-sophisticated approach. I will have a chat to Jonty about it.

  6. mrsean2k

    Forgive these questions for jumping around a bit;

    1) In terms of constraints on processing time, is the number of model runs / parameter variants the dominating factor, as opposed to the process of removing, inflating and averaging?

    2) What sort of spread in terms of resolution do your dependent proxies have?

    3) As far as the subjective elements of the process are concerned, how is the degree of subjectivity mitigated? Are multiple people involved in each of the decisions? If so, what process do you follow, or is it ad-hoc in every case?

    • Tamsin Edwards

      Good questions mrsean2k.

      The model runs are much slower. At the time (2007-2009) they ran at about 3 model years per day, sometimes less, and all my LGM simulations are 200+ years long, so once you factor in some stops and starts it takes about 3 months to run one simulation.

      The processing from raw model output to netcdf and then text files to read in R was a bit fiddly because I wanted to strip out small tidy files to upload online one day. I did a day or two at a time, so it might have been a week or so recently but this was adapted from work in 2010.

      The statistical analysis took about three weeks (in between many other things). This is why we want to check more thoroughly before concluding much!

      Resolution spread – what do you mean? Space? Time? Signal-to-noise?

      Subjectivity. Well, we are doing Bayesian statistics. This is quite an important part of what I want to say in this blog so this comment won’t do it justice. The general idea is that subjectivity in science is inevitable – we are constantly making judgements – so instead of trying to eliminate it, and failing, instead we are upfront about it. We make all our choices clear and say that the result is *our* result. If someone disagrees with our choices they can repeat the work with their own and get *their* result.

      Having said that, we do try to ensure the choices are sensible! So the choice about how much to inflate the ensemble spread is ours, but then we check it by comparing the prediction to the reconstruction. We kept making that number bigger until the “diagnostics” (the histograms I mentioned) looked OK. We had to make it bigger than we expected….

      • Lindsay Lee

        “Subjectivity. Well, we are doing Bayesian statistics. This is quite an important part of what I want to say in this blog so this comment won’t do it justice. The general idea is that subjectivity in science is inevitable – we are constantly making judgements – so instead of trying to eliminate it, and failing, instead we are upfront about it. We make all our choices clear and say that the result is *our* result. If someone disagrees with our choices they can repeat the work with their own and get *their* result.”


  7. Joe's World


    Their is a great deal that models do NOT consider or include.
    Mechanical processes sacrificed for data gathering that do not include many parameters.
    I mapped the different velocities of our planet which can be applied to any rotating body even orbits.

    So far, I have found our measuring of atmospheric pressure is very inadequate for the mechanical processes that are in place.

  8. BBD

    Well, I’m afire with impatience to know 🙂 Also amused to see Annan & Hargreaves having a go at ‘replicating’ your results 🙂 (Yes, of course I was scouring twitter to see if someone had let the cat out of the bag).

  9. Joe's World


    I have had my chain yanked by many, many scientist that give an answer that they hope does not actually get checked out.
    Yet I want answers, so I follow their claims.
    Grid square method…did you know that their are 129,600 grid squares at one degree latitude/longitude?
    Many do not have ANY data at all. Many have multiple data.

    But not one includes any mechanics, just data to be deciphered.

    • Tamsin Edwards

      Hi thingsbreak – thanks! That was a great interview and I encourage everyone to take a look. It was good to see Nathan (who is the statistician on Schmittner’s study) discuss the strengths and weaknesses in some depth for a general audience.

  10. Alexander Harvey

    Hi Tamsin thanks for this post and good luck with your collective efforts.

    I have some thoughts on the method which are a bit hazy but hopefully close enough be some basis to be improved upon.

    The choice of statistical models for the simulator emsemble plus reality problem has a history. This much I know, I am hazy about the details but some examples are in the Rougier, Goldstein & House paper linked above. I do know that the notion of exchangeability is crucial and debatable. I am no expert but if I attempt some illustration it may inspire one to correct it.

    The model seems, to me at least, to propose that a sample of simulators can be drawn from a supposed simulator generator in a way such that the sample is exchangeable in that each produces results that come from a common joint probability distribution.

    I believe that it is not possible in this case to demonstrate exchangeability but it is possible to infer when it is not present. In this case whether outliers are unlikely to be exchangeable which could be attempted by some test or by consulting an expert e.g. squinting at the results as Tamsin describes. The question of whether an expert could show skill at picking out a non-exchangeable member is important here but more so when it comes to whether an expert could pick out reality when included in an emsmble of simulators if fairly presented, e.g. soemthing more subtle than the global temperature series.

    Whether one emsemble member is a bit of an outlier on one particular challenge is not I think the criterion. Whether one member is persistently an odd-ball to the degree that its commulative statistics can be used to infer that it is not drawn fairly from the same simulator distribution as the other members is a possible criterion.

    If we judge that the membership is exchangeable we can know something of the statistics of not just the emsemble, but the set of simulators from which the emsemble is a sample, and in a way that is more easily justified than if we judge otherwise. This gives a justifiable case for a necessary removal of odd-ball simulators.

    The case for removing simulators that are similar or familial is I think to do with avoiding double counting which would distort our view of the underlying distribution of simulators.

    As I see it the Rougier paper argues that the notion, which has its supporters I think, that reality is exchangeable with members of an emsemble of simulators is neither easy to justify, likely, nor evidentially supported. It is argued that a bar to exchangeability is due to a discrepancy from the emsemble , labelled U, and that U has the properties of a statistical function, e.g. a mean and variance. It is also assumed, that the mean E(U) = 0, presumably an assumption until shown otherwise if that is possible. This I find puzzling. One could see this as infering that reality is centred on the emsemble mean but that would be surprising. I think it is more an expression of ignorance about E(U) than a determination of its value. What is argued is that the Var(U) is greater than the variance of the emsemble about its mean under the assumption the E(U) = 0.

    As I understand it, the notion of inflation, expanding the variance of the emsemble, could be performed to a degree where the new scale is such that reality would not stick out like a sore thumb, that we could reach a situation whereby we could no longer judge reality as being identifiably non-exchangeable with the emsemble members. I would presume that we could devise statistical tests or hurdles to support that. Bearing in mind that the reality of say the average temperature field during the LGM is not a single value, and could be decomposed into a set of orthogonal functions each with a scalar value as could the results of the emsemble, justifying inflation seems to imply that the emsemble variance for each of these scalars when inflated by the same amount would render the values obtained from reality (as made fuzzy by the process of inference from proxies) in some way plausible for a possible member of the inflated ememble, and perhaps that some particular degree of inflation would be, in that sense, most likely.

    I could be argued, and I expect will, that this is all nonsense, not just my interpretation, but any attempt along the lines set out in the paper, to construct a model that embraces both simulators and reality. The alternative is to find another way or to determine that the simulators are informative, or at least fail to show otherwise. Obviously the objective here is to render an emsemble of simulators informative in a way that can be clearly described and justified, by which I mean only that a justification can be stated in sufficient detail that it can be debated. The hurdle to the inclusion of simulations in refining our understanding of the temperature field during the LGM, or in similar cases, can be considered to be quite low: “Are they worse than useless?”. Even if they lack skill in determining a value any ability to shed light on the accuracy, or lack of, with which another imperfect inference, in this case from proxies, is to be welcomed. I see this as an example of using best efforts to form a best current judgement which needs to described with justification, in sufficient detail to be accepted as fit and proper by those expert in the field and allow them to support or dismiss it on such terms.

    This much is true, reading this page, the paper and the presentation linked to, has provided some answers, but not yet the one that most may be awaiting. In my mind it has raised as many or more questions. Hopefully I will not be alone in finding that the approach is of as much or perhaps more importance than this current and hopefully preliminary instance of a result but I accept it may be a minority view.


    • Jonty Rougier

      Hi Alex,

      You raise several interesting questions in your post, but I just want to focus here on the big picture, of how we do quantitative inference, and what it means.

      The judgement of exchangeability of a sequence is a prior judgement, made on the basis of the operational definition of the sequence but before the values are observed. It asserts that the sequence labels, the i’s on the Xi’s, are uninformative about the sequence values. In our case the sequence labels index the rows of a table with 17 rows and 31 columns, where each column indicates the value of a simulator parameter. So what we are asserting, as our statistical model, is that when we ask ourselves the question “Is, say, sea surface temperature in the Azores higher or lower in run 7 than in run 9?” then, after consulting the table, we still reply “Dunno”. And this “Dunno” holds for all subsets of the 17 runs. Or, to put this another way, when we actually compare 7 and 9 and find that sea surface temperature is higher in the former, we don’t say “Ha, I thought so!”, but merely, “Oh”.

      Now there are both good and bad things to be said about a statistical model based on a prior judgement of ignorance! The good things are that it is mathematically tractable, and that there is a level playing field. I do not particularly buy the second argument, but undoubtedly it has political appeal. The main bad thing is that by adopting ignorance as a starting point, one discards judgements that may have a bearing on the outcome. Discarding such judgements is always a painful business, given the efforts that scientists make to acquire them.

      However, this discarding of judgements has to be seen in the wider context. We acquire our judgements through a long and opaque process of experiment, reflection, and discussion. When we are challenged to quantify our uncertainty, we find that some judgements are very hard to quantify. We often find, in practice, that the demands of the probability calculus, or indeed of any inferential calculus, exceeds our ability to quantify. This is a particular challenge for the probability calculus, whose very elegant and tractable structure follows from a very precise quantification; which is why I prefer to quantify uncertainty in a a much less intensive way, in terms only of means and variances and covariances. Even this is hard: what is the covariance between sea surface temperature in the Azores, and mean annual temperature over the Rockies?

      So at the point where we quantify our uncertainty and do a calculation, we must acknowledge that the result is only part of the answer. In his very famous book on the foundations of statistics, Jimmy Savage wrote about the ‘large world’ in which choices are made, and the ‘small world’ in which we quantify our uncertainty. The small world being constrained by what we feel able to quantify and analyse. We must not confuse the two. But there is no formula to take us from the small world to the large world. What we do, instead, is more judgement-forming. We incorporate the results of a small-world statistical quantitative assessment of uncertainty into our large-world assessment. Thus it is not “Climate sensitivity is X” or “Edwards et al say climate sensitivity is X” but “Edwards et al’s small world assessment of climate sensitivity is X”. And we recollect, if we can, the nature of the small world statistical model when we assimilate X into all of our judgements about climate sensitivity.

      So, in our small world we have decided to treat the runs of the perturbed parameter ensemble as exchangeable, and to treat climate as respecting exchangeability with the ensemble (but not jointly exchangeable with the ensemble). Then, we are able to derive reconstructions and a prediction for climate sensitivity, both with an assessment of uncertainty, in a straightforward and transparent manner, which it is easy for other people to assess, and to critique. The judgement of exchangeability committed us to using all 17 members of the ensemble, but it would have been bone-headed of us not to check them anyhow! In the paper that Tamsin mentions, with Michael Goldstein and Leanna House, we discarded several members of the CMIP3 ensemble on a priori grounds, and one more on a posteriori grounds.

      So, a statistical model is just that: a model for our reasoning about the underlying system. We should not confuse it with our actual reasoning about the system, anymore than we should confuse a climate simulator with actual climate. When we assimilate a statistical prediction into our judgements we must take account of the small world simplifications we have made. The most important thing is that these simplifications are clearly stated. This is particularly easy for statistical models based on the premise of ignorance, which is why such models are ubiquitous in statistical analysis. But in our case, the complexity of the simulator outputs and the large number of parameters being simultaneously perturbed also makes it a plausible model for our prior judgements about the simulator runs.


      • Alexander Harvey

        Hi Jonty,

        Thanks, I think I am currently more confused but hopefully more purposefully confused.

        I think I can see the path though. I will read your paper again, in combination with others by the same authors and also by Simon Shaw. I really need to get exchangeability, second order exchangeability, co-exchangeability, and their equivalent for functions e.g. second order exchangeable functions, sorted out.

        From there I will work on why my intuition steers me towards “truth plus error” as my default model when perhaps it shouldn’t (this you cover and it should be just a matter of comprehension).

        On a more detailed point, I fear that I am not appreciating the effect of the relative dimensionality of the retained output to the number of simulations. E.G. I expecting reality to necessarily be meaningfully nearer to one of the simulations than it is to the mean. I will try to quash this.

        What you have written about the big picture is welcome. I can see that I tend to confuse the “model for our reasoning about the underlying system. … with our actual reasoning about the system”. However I am not quite sure how to rein in this tendency. It seems to be a prejudice to impose misplaced meaning on the model rather than on the experiment, hopefully that makes sense.

        I have found much of this has come as rather a shock. It is more indirect than I realised and I seem to have been trying to fight against that. I wonder if I am alone in that.

        I will try to make some further comments when I have something meaningful to add and better know how to express myself. For now, I give my thanks again.


  11. Alexander Harvey

    By the bye,

    One of the slides hints at a particualar value being inferred for the inflation factor. If that is what you have found, that alone may tell a tale.


  12. dave souza

    Many thanks for a fascinating glimpse into your progress on this aspect of climate science, and for putting it into lay language so that non-scientists like myself can get a feel for what’s going on. It ties in with other recent reading (a book by someone who will remain nameless in this post) about developing new statistical techniques or adapting statistical methods from other areas, to relate various factors and find estimates for uncertainty. I’m also reminded of the description in Hansen’s book of using Last Glacial Maximum temperatures to estimate climate sensitivity.

    Given that you’ve chosen one well established model and the same reconstructions as Andreas Schmittner and his group, does it follow that this preliminary result is more about testing your methodology than in establishing sensitivity? There will now be an impetus to try out the technique using different models and reconstructions, both refining results and raising new questions.

    Thanks also for the linked material, interesting reading and a lot to take in!

    • Tamsin Edwards

      Thanks dave. Do take a look at our review paper (aimed at non-specialists) if you’re interested in palaeo estimates of climate sensitivity.

      I could say it is about testing *their* methodology 🙂

      It is a well-established model, but not yet used much for perturbed parameter ensemble (PPE) palaeoclimate simulations. I think I’m right in saying we’re the only group to have a two era palaeo-PPE with an atmosphere-ocean circulation model. Even one era PPEs are quite rare and usually use a simplified ocean or quite low resolution.

      • dave souza

        Fair enough, that’s a better way of putting it! The point I was clumsily trying to make is that you’re putting forward a new and improved method with advantages over their approach, and hopefully this will lead on to productive research rather than just being a new answer. The review paper is on my reading list!

        As you’ll perhaps already have seen, the Schmitter group result was discussed on RealClimate and given a cautious welcome with some doubts about the findings. The blog posting is worth looking at for the section on “Response and media coverage” which points to problems that arose with publicity: see http://www.realclimate.org/index.php/archives/2011/11/ice-age-constraints-on-climate-sensitivity/

  13. Sashka

    I’ll have more questions and criticisms later but for now I’m still at work so the most obvious one:

    Unfortunately we don’t know the correct setting for these dials

    Nor do you know the correct settings exists. E.g. are there correct values for subgrid mixing coefficients?

    How does this reflect on uncertainty or even validity of your results?

    • Lindsay Lee

      ”Nor do you know the correct settings exists. E.g. are there correct values for subgrid mixing coefficients?”

      It’s my understanding that this is precisely the point and so we try to take this into account by the fact that the observation and the ensemble members are not co-exchangeable.

      The discrepancy between the ensemble and the observation takes into account the fact that there is no ‘correct’ parameter/dial setting that will match the observation and that there are processes that are missing. The work here takes that into account and therefore produces a measure of uncertainty that is an improvement on previous works. It’s also very transparent and can be tested/changed/argued in discussions like this take place and as more knowledge is gained.

      This method does not improve the prediction of climate sensitivity or the models used to predict it per se but we have a ‘better’ measure of uncertainty around the prediction and all the assumptions in making the prediction are stated and any decisions that might be made from the prediction are better informed.

      • Alexander Harvey

        Hi Lindsay,

        Thanks for the link that you gave elsewhere to the GLOMAP paper, fortunately it is very detailed and will answer many things if I read it carefully enough.

        Moving back on topic, I am struggling with these notions of exchangeability and again it seems it may be a matter of reading carefully.

        Here you say that the observation (Z) is NOT assumed to be co-exhangeable with the set of simulator outputs.

        E.G.the joint set {Xi, Z} is NOT co-exhangeable i.e. it is NOT assumed that Cov(Xi,Z) = some constant matrix C for all i. Have I got this right?

        Yet I think in the Rougier et al paper featured here it assumed that “truth” (Y) is co-exchangeable with X but not (second order) exchangeable.

        In the specific case of the Rougier et al paper, Z = HY + W which I would have thought implied that if Y is co-echangeable (with X) then so is the observation Z

        On the broader points, I feel that something I suspect is true is not coming across. As I read it the adoption of a subjective Bayesian stance with respect to climate analysis is actually leading to weaker (less restrictive, more generalised) assumptions than had been the norm. So in a sense is proving to be LESS subjective (less ambitious, relying on fewer assumptions), having less subjective information content,

        I beleive that the most ambitious and restrictive of current assumptions is that “truth” (Y) is exchangeable with X which could bear the interpretation that “truth” would be given as an outcome of some undiscovered simulator that is exchangeable with the current simulators.

        Less amibtious is the truth plus error assumption, that “truth” is not exchangeable with the simulators and that “truth” differs from X by some error term (B).

        Somewhat less ambitious is the co-exchangeable assumption where truth is not restricted to only differ by an error value, and is the one chosen in the paper.

        Each of these has its adherents. The most ambitious assumptions are those most taxing to justify. In a sense they are more strongly subjectively informed, e.g. if they impose greater restrictions justified on subjective grounds such as expert opinion.

        Working up that list each level has the restrictions of those below but imposes additional restrictions. It could be said that this equates to the input of more information of a subjective nature. If each approach used the same data (observational and simulator runs) and the more restrictive assumptions lead to tighter restraint on the variance (more precision) in the result i.e the result contained more information (lower entropy), this could surely be viewed as having come from the additional information in the assumptions. This point interests me. Were such an analysis made the excess precision could be stated as a value (say change in entropy) perceived as the information extracted from the additinal information content of the assumptions, for I cannot see where else it would have come from.

        Were all three approaches attempted resulting in three different results would there be a best result?

        Surely not!

        If it is proposed that there is a true value the results do not give it, they make statements not about the truth but about ingorance of the truth. Each result is different because they contain different knowledge about the truth gleaned from the same observations and simulations, but differing expert judgement. Even if the truth was independently revealed it would not be a criterion for judging which of the approaches was the best. Picking the one that gave the highest probability for the revealed truth is not I think sound. All that picks is the approach that found the revealed truth to be the least surprising. Who is to say that given current evidence and wisdom the revealed truth would not be surprising?


        • Lindsay Lee

          Hi Alex,

          Firstly, I should say that I don’t work on this project – and haven’t worked on model discrepancy myself, so I don’t know the specifics on the project Tamsin works on or have any experience in the application of this. I’ve only read the Rougier paper.

          I think you are absolutely correct with your first few paragraphs and I was wrong when i said ‘co-exchangeable’ when I meant ‘exchangeable’ when referring to Z. The observation cannot be considered to be drawn from the same ‘population’ as the samples in the ensemble since we know that a discrepancy exists.

          I also agree with your view on Bayesian statistics and subjectivity. In my experience declaring the assumptions used makes people ask important questions that otherwise don’t tend to enter the debate.

          The remainder of your comments i’m afraid i’m not going to be able reply on comprehensively today. I will sit down with the paper and try to form a definitive view of my own on the difference between the truth plus error and exchangeability – I admit when I read the paper I made a note on Equation 1a that said ‘how is this different to the truth plus error model’. I then read on that the ‘truth plus error’ model is a special case of Equation 1a and I need to understand this better. I’ll get back to you on this and hopefully have something to add to your points above.

          I will say though that the key difference, as I understand, in the two paradigms is the effect of gaining more models for the ensemble. If the truth plus error model is used then the effect of adding models to the ensemble is to reduce the uncertainty in the estimate of the truth but in the exchangeability model the uncertainty is not reduced but perhaps better understood/quantified (unless the models are known to be improved!). The IPCC good practice guide has some discussion on this and is a good read.



          • Alexander Harvey


            Thanks for the link. I read through and picked up on the statistical models 2a “truth plus error” and 2b “exchangeable”. 2a is indeed not what I had in mind and represents a tight restriction that the ensemble is centred on “truth” and not in way that can vary. e.g. the difference between truth and the expection of an infinite sequence of models is zero and hence not a random variable. Such a model would I think head my list as it is the most restrictive.

            I will come bak to this when I have a better grasp.


          • Jonty Rougier

            Hi Alex, hi Lindsay,

            It is a bit confusing. Formally, we choose to treat the ensemble of climate simulator outputs (or perhaps a subset of it) as second-order exchangeable, and actual climate as respecting exchangeability with the ensemble. We refer to this as the “co-exchangeable model” because otherwise it is a bit of a mouthful. As Alex notes, this implies that Cov(Xi, Y) is the same for each i, and, when we add in our model for the measurements, it also implies that Cov(Xi, Z) is the same for each i.

            One special case of this model is that {X1, …, Xm, Y} are jointly exchangeable. This is what James Annan and Julia Hargreaves were aiming for when they tested whether Y was “statistically indistinguishable” from {X1, …, Xm}. This is because joint exchangeability implies that the rank of Y in {X1, …, Xm} is uniform. The difficulty with such tests is that it is very difficult to evaluate their power, and so one is left not knowing what the observed rank of Y indicates. This is a generic issue with hypothesis tests, and particularly acute when the sample is small.

            We would reject that special case a priori, because of our judgement that any pair of climate simulator outputs are more likely to be like each other than any one of them is like actual climate; this seems to be a judgement that is widely shared. It implies (in the context of the model outlined in our paper) that Var(U) > Var(R(X)). James Annan pointed this out to me, in a workshop in Cambridge last year.

            The other special case of our co-exchangeable model is the so-called “truth plus error” model. It is not at all obvious that this is a special case, and it was Michael Goldstein’s perception that revealed it, and also that it imposes a very nasty constraint on our co-exchangeable model that would seem to be hard to justify in practice. This is covered in our paper.

            On the issue of subjectivity, I would hesitate to describe one approach as more or less subjective than another. Subjectivity is the essence of uncertainty, since uncertainty is a property of the mind. But, when one is doing science in the public domain, undoubtedly some approaches are more defensible than others. An approach that requires one to specify means, variances, and covariances is already challenging to defend, since these are not easy quantities to elicit. How much harder, then, to defend an approach that requires a full probability distribution: not just the first two moments, but an infinity of them!

            As part of our defense of our model and our judgements, we can can show that they are consistent with the observations. The fact that it is so easy to check this property is a very attractive one for a mean-and-variance approach. You will scan the climate science literature in vain for any attempt to check the higher-moment judgements of fully probabilistic approaches. And in fact they are made for reasons of familiarity and tractability. As I remarked in my previous comment, this should not condemn them, but we should be sensitive to the gap that exists between the statistical model and the scientific inference.


  14. Fred Moolten

    Tamsin – Thanks for a fascinating overview. I have a sense that your latest effort has some features more reminiscent of Holden et al (2010) than Schmittner et al. I’ve described my impressions of both previously elsewhere. They use somewhat similar general concepts but differ in their parametrizations and in their median climate sensitivity estimates, Holden emerging on the higher side of 3C and Schmittner on the lower side. I don’t feel qualified to judge either in depth, but in the case of Schmittner, I had some problems with the very meager LGM cooling estimated from proxy data as well as the difficulty reconciling land and ocean based estimates – the low ECS was influenced primarily by the ocean data. Conversely, Holden et al may have overestimated LGM cooling. I would be interested in your perspective on the comparison between these two papers.

  15. Alexander Harvey

    Others have mentioned the Schmittner paper so I feel should I.

    There is something that struck me as very odd about the Schmittner distribution. So odd that I didn’t pay much further attention to it at the time. The distribution looked more like a spectrogram.

    Please note that I have not read the paper but fortunately the SI is freely available and contains a detailed account of the method.

    I think the issue is that the method produced a result capable of resolving fine details. Visual inspection indicates that the method has a resolution of about +/- 0.15-0.2 ºC at 1-sigma for the finer of its multiple peaks. Given that the posterior grid cell error is at minimum 8 to 10 times that size the implication is that there are lots of degrees of freedom. I think that this would be given by the square of those ratios i.e. 64-100 dof.

    That is all fine and dandy as it probably equates to the realised spatial dofs for the 435 grid cells with reconstructed data given there assumed correlation length.

    What puzzled me was this inherent resolving power showing up in the final result.

    I think they spotted there was an issue, noted it but let it be. From 7.6 Statistical Assumptions:

    “… Eliminating the nugget implies a larger ECS2xC but produces an extremely narrow uncertainty range, suggesting a mis-specified statistical model.”

    In my terms they exposed the inherent resolving power of the model by removing the nugget (a chunk of uncorrelated error thrown in for good measure).

    It is commonly assumed that climate sensitivity may have a value. The goal in this pursuit is to improve the resolution of our view of that value. Typically this is at around +/- 1ºC at 1-sigma. So most other methodologies have an inherent resolving power several times lower than the Schmittner method.

    It may be natural to assume, as I think they seem to have spotted, that there is a problem in their error covariance matrix. They have a lot less error variance to contend with than is normally the case by around the factor indicated above 64-100.

    I think the issue is that there is no representation of model discrepancy in the error matrix above and beyond model noise. There is nothing representing the acknowledged model bias. This in effect declares the model as perfect except for some noise even though it is known not to be able to match both ocean and land temps simultaneously.

    Part of this problem they acknowledge in the same section:

    “Bias: Four experiments assume nonzero bias over land or ocean of ±0.5 K. The ocean bias has a particularly strong influence, altering the ECS2xC estimate by about 1 K.”

    So half a degree of ocean bias produces one degree change in the result.

    Actually that is not what interests me as I think this doesn’t do the effect justice. If an effort was made to incorporate model bias it is likely that the power of resolution would drop back down to where it normally is. The distribution would look much more smooth.

    Now perhaps I am missing something here but it looks to me like they have left out what is commonly considered to be a large and potentially the dominant error component. Importantly it may also be, or is even likely to be, the primary source of correlated error and long spatial distances. If included in the error covariance matrix it should tend to reduce the spatial dofs considerably. Overall it would likely have the effect of smearing the distribution to a wider and more normal distribution. The acknowledged snag is as follows:

    “In the default analysis the bias is assumed to be zero, because it is highly confounded with climate sensitivity, the quantity of interest. Arbitrarily large climate sensitivities (cold LGM temperature anomalies) can be made compatible with the data by introducing a sufficiently large positive model bias, and similarly for arbitrarily small climate sensitivities.”

    Yet they acknowledge the discrepancy between land and ocean when trying to match modelled and reconstructed temperatures.

    One bit I really don’t get is given that there are effectively two distributions, one for land and one for ocean do you assume that our combined ignorance of the truth is spanned by the largely disjoint distributions as they seem to (it is as tight or tighter than either), or that our ignorance of the truth should span the possible ignorance if the distributions are as disjointed as they are. I know which I would choose but then that would give a range from 1ºC to 5ºC and I wouldn’t get a star.

    It puzzles me why they seemed to notice but largely ignore the presence of spurious (my word) resolution, it shows up in not one but several of their sensitivity tests e.g. both ocean bias tests +/- 0.5ºC resolve sharp peaks and not a lot else. I would have thought the odd spectral appearance of the distribution was a hint, but then I did.

    Their treatment has I think to be contrasted with the Rougier et al, treatment which has model discrepancy at its core.


    • Alexander Harvey

      By the bye,

      If I am right about this and they had pre-smoothed their reconstructed data thereby reducing the spatial degrees of freedom down to a handful, as I think others have, they may have lost much or most of the spikiness and not worried that they might have a “mis-specified model”. So I would see that part of the approach to be superior in that it is more telling of potential dificiencies elsewhere. The pre-smooting approach might still have the same issues but they may not be so apparent and not get caveated yet still effect the final result by making it more certain than might be easily justified. So for that much they are to be thanked.


  16. Joe's World


    A similar model that is full of errors is how scientists state that the “axis has shifted based on our models”.
    That statement is an impossibility as the axis starts from our core through a liquid medium past a hard shell.
    The actual correct statement would be that the shell has shifted over the axis.

  17. Judith Curry

    Hi Tamsin, v. interesting post, I look forward to reading the papers (and of course to seeing your actual sensitivity result!) While dealing with model parameter and initial condition uncertainty are essential, the most profoundly important issue in climate model uncertainty IMO is model structural form.

    Existing climate models generally use essentially the same model structural form, although we are seeing some divergence from this with HAD going nonhydrostatic and introduction of interactive carbon cycle (but as far as I know systematic sensitivity tests re climate sensitivity are still needed esp for hydrostatic vs nonhydrostatic).

    Testing the sensitivity of climate sensitivity to different model structural forms is the next frontier that needs to be tackled, IMO. Systematic differences depending on the numerical solution method should be explored; the NCAR model would be ideal for this since they have different solvers for the atm core that can be used. But the bigger issues IMO are true model structural form issues such as stochastic vs deterministic models, whether we need to go to multi-phase formulation to actually get water vapor feedback correct, etc.

    So my comment is a reminder that model parameter uncertainty is not the only source of model uncertainty in climate model simulations.

    • Sashka

      model parameter uncertainty is not the only source of model uncertainty in climate model simulations.

      More generally: parameterization uncertainty.

  18. Pharos

    ‘Reconstructing past climates is difficult, and it’s even harder to estimate the uncertainty, the error bars. I won’t discuss these difficulties in this particular post, and generalised attacks on you know who will not be tolerated in the comments! We used reconstructions of air temperature based on pollen ** and reconstructions of sea surface temperatures based on numerous bugs and things. Andreas Schmittner and his group used the same.’

    Good stuff. No attacks on you know who- he was gardening, you are digging bedrock. I mean getting to grips with the Last Glacial Maximum in models as a benchmark. The current Holocene (=modern man and civilization) and earlier interglacials after all are abnormal for the Quaternary as a whole – the full glacial situation normal. Frustratingly (for old retired bug pickers anyway) the LGM paper is paywalled. I am wondering how much the regressed continental shelf situation caused by the full glacial sea level affects the planktonic and benthonic assemblages, also low stand SL alterations to oceanic circulation? The controlling influences dictating the onset and termination of full glacials are, or should be, the primary academic inquiry of climate science, but being non-anthropogenic in nature perhaps harder to justify research funding?

    BTW facinating new paper by Svensmark re glaciations to throw into the climate crucible.

  19. Disko Troop

    Hi Tamsin. Layman here!

    Do you ever wake at night and hear the screaming of the tortured data as they rattle their error bars in an attempt to escape the gridcells of their confinement? Do you hear the whispers of their confession as the inquisitors twist the 31 screws tighter on the rack; do you really hear their cries from the 17 torture chambers or are their confessions actually drowned out by the noise? [ 🙂 ] Don’t you just love a metaphor?!

    • Joe's World

      It is called sacrificing knowledge for funding and hoping that the theory holds up.
      Creating an ignorant society with illusion by pretty graphs and complex terms to keep the masses confused and in a mind fog.
      Sort of like getting your mind screwed over while the consensus laughs all the way to the bank.
      Do not actually need a single shred of physical evidence, just data that is not actually a physical form.

      • Steve Bloom

        This variety of commenter seems to be becoming more common here, Tamsin. It might be worth considering why that is.

        • Tamsin Edwards

          Sorry, I’ve been side-tracked with a deadline…

          Joe’s World, you are not sticking to the comments policy – I ask that you don’t generalise and accuse, please. If you have specific concerns with the science in my post, feel free to discuss them. But no mudslinging.

          Steve – yes, Joe’s World did breach the policy. But I think this is partly because I have not been as quick to pull him up on this as I was with others. I also don’t wish to “preach to the converted” – I do want people who think climate science is a big cover-up or badly done to come here . I want them to see transparent, honest and (hopefully!) good science being done.

          • tonyb


            You said;
            ‘I do want people who think climate science is a big cover-up or badly done to come here.’

            I am a historical climatologist. I think a sense of historical perspective and context (not a cover up) is missing from the AGW debate. There is a general impression that climatic conditions today are unprecedented which is simply not true, even going back through the Holocene. As for science being badly done, this is the other side of the same contextual coin. Much of the historical record is conjecture, yet this is fed into models and the results lauded as correct to fractions of a degree.

            I was at the Met office library the same day you visited there recently and prior to that was in their archives. I am visiting a Norman Cathedral next week in order to try to discover if there are any weather records there that can be used to validate a 14th Century weather diary. Examining these makes you realise the fragility of the data we have come to rely on. Historic Global SST’s for example should have no place in a computer model as they are largely invented or so theoretical as to be (largely) worthless up until the 1960’s. The land record is riddled with the same problems that were identified over 100 years ago. The sea level data in Chapter five of AR5 relies on conjecture derived from a very limited number of northern Hemisphere tide gauges without the historical context that sea levels have risen and fallen throughout the Holocene.
            Yet still we use all this data as if they have been handed down from an infallible oracle.

  20. Mark M

    I was doing some research on how effective our efforts in CA have been towards our goals (20% RES specifically) and I came across a group of modelers that had a rather nice description of the limitations of the process (ie Take the model outputs with a grain of salt, or maybe a shot of Scotch)- From the – “Uncertainties in the Analysis” section……… of

    “Greenhouse Gases and the Kyoto Protocol” http://www.eia.gov/oiaf/kyoto/pdf/execsum.pdf

    …..”Results from any model or analysis are highly uncertain. By their nature, energy models are simplified representations of complex energy markets. The results of any analysis are highly dependent on the specific data, assumptions, behavioral characteristics, methodologies, and model structures included. In addition, many of the factors that influence the future development of energy markets are highly uncertain, including weather, political and economic disruptions, technology development, and policy initiatives………..”

  21. Jim Bender

    Too bad that there are some much politics involved with this subject, as you seem to be trying hard to do science in the middle of it all. I have been reading about climate science since things heated up a while back. I am reading your blog because of Dr. Judith Curry mentioning you. I read her blog and others on the subject. While I have a technical background, I don’t have a background that allows me to dig into the mathematics. I am both a programmer, historian (17th Century Dutch naval history), and author. I am against ad hominem attacks, as they make no attempt to deal with the real issues. I don’t like people, however, who try to use climate model results to achieve political ends. Scientists should stick to doing science. I say, lets better understand the uncertainties involved so we can better interpret the model results.

    • Mark M

      Hi Jim,

      I’ll bet your understanding of the math is better then mine. My last official program was spec’d out, written and confirmed (design verification, software V&V etc.) in Lotus 123! I had to validate the dang math with an HP calculator- which was better then having to use my slide rule and by hand- the really old fashioned way. I am a big fan of experimental design (Box Hunter and Hunter to be specific) and nothing beat using their CCD experimental design to ensure a process (or at least what I was measuring) worked. I love our hosts title by the way!

      I personally think your history background is needed BIG time to put the issues of climate change in perspective. I have come to the conclusion that about all I can do listen to all sides of the issue and hopefully come up with a question or two that might help with implementation (which I am pretty good at). I personally don’t like the politics of the issue so I just try to define the system requirements, risks and alternatives that could be done to meet the objectives. I don’t know about you, but I am having a bit of an issue with energy efficiency as currently defined. There is a reason the Maytag repair man ads no long run. The Mean Time to Failure on a lot of our new operating efficient appliances suck when you evaluate then from a lifetime evaluation.

      Sorry for venting. My recent attempts at understanding what has happened in the past has been limited to trying to understand the concepts, etc noted in “Paris 1919” and Albert Einstein- Historical and Cultural Perspectives. Every once in awhile I review “The Environmental Handbook, Prepared for the First National Teach-In.

      • Anteros

        Mark M – I agree with you that we could all do with a greater historical perspective, and I don’t just mean a better appreciation of life during the LGM!

        Tamsin – a great post and very well explained. I’m glad you avoided mentioning ‘The Number’ (or a range with error bars) because I agree with you that it would obscure (completely) the process, and the understanding..

  22. Paul Matthews

    “Climate sensitivity is the global warming you would get if you doubled the concentrations of carbon dioxide in the atmosphere. The earth is slow at reacting to change, so you have to wait until the temperature has stopped changing.”

    There is a dangerous, and probably incorrect, implicit assumption here – that if the forcing is constant, ‘the temperature’ will settle down to a constant. Do the models really do this, or are you simplifying here?

    • Paul S

      One way to check this is by looking at a long pre-industrial control run, taken from Climate Explorer. This is a run in which all ‘forcings’ are held constant at a pre-industrial level for the entire duration.

      There is probably a small drift, but generally persistent internal variability doesn’t take surface temperature more than 0.2K from the “normal” state at any point – it’s pretty stable, in other words. Judging by the large interannual variance this particular model may actually be producing too much internal variability.

    • Paul S

      Here’s a shorter pre-industrial control run for HadGEM1, which I think was the successor of sorts to the HadCM3 model Tamsin is using. The interannual variability here is much more similar to observations and, in kind, the decadal/multi-decadal vaiability is smaller, not straying much more than 0.1K from “normal”.

        • BillC


          Both those runs appear to have a statistically significant warming trend. What do I make of that?

          • Paul S

            Yes, I mentioned the apparent drift. As I understand it this is usually caused by different elements in the model not being properly ‘acclimatised’ to each other at the start point. For example, if SSTs are too low there will be an energy imbalance, which can only slowly be recovered. If you look at the IPCC projection chapter it says they correct for this drift by comparing the forced runs with the control run. Some models have warming drifts, some have cooling drifts.

            It’s not really that important for climate sensitivity estimates – the difference is in the region of 0.1ºC, which is considerably smaller than the overall uncertainty.

        • Paul Matthews

          Good point Bill. So it seems you’d have to average over some timescale to find ‘the temperature’. But what timescale? How do you know it isn’t going to carry on drifting upwards and vary over a very long timescale? In the first of Paul’s runs, the ave over the first 100 yrs is quite different from the ave over the last 100. Makes you realise that the definition of ‘the climate sensitivity’ is not so straightforward.

          • BillC

            Or, a bizarro conclusion that the models warm the planet under preindustrial CO2 concentrations!! 😉

  23. Paul_K

    I wish you well with your paper, but it seems to me as though you may be ignoring a major challenge when you use the HadCM3 model as a basis for comparing LGM behaviour with modern forecasts. HadCM3, like most GCMs, is nonlinear in its flux response to average temperature change. Specifically, if you plot top-of-model net flux against average temperature change for a fixed forcing value, you will see a pronounced curve, not a linear response. Alternatively, this nonlinearity can be characterised as an observed increase in effective climate sensitivity with time and temperature.
    Two possible explanations are offered for the nonlinearity (if we exclude gross error in formulation). The first is a nonlinear dependence of feedbacks on temperature. The second is that the model causes a redistribution of temperature over time, so that a large increase in average surface temperature change results in very little reduction in net flux imbalance. As I am sure you are aware, this latter phenomenon is characterised by Senior and Mitchell as a significant hemispheric imbalance induced in the HadCM3 model.
    Held and Winton 2010 (http://www.gfdl.noaa.gov/bibliography/related_files/ih1001.pdf) carried out some numerical experiments on the GFDL CM2.1 model, which exhibits very similar behaviour to HadCM3 in terms of the nonlinearity of response. This work produced the rather extraordinary observation that the temperature response of the model exhibited strong hysteresis; after the evolution of a positive cumulative forcing, a negative forcing was applied to reduce the total net forcing to zero, but the model’s average temperature showed a recalcitrant increase relative to its initial state; the longer and stronger the initial cumulative forcing, the greater the level of recalcitrant temperature gain observed in the model after it was restored to a zero net forcing. Professor Held seeks to explain this as a “locking-in” of the spatial redistribution of temperature.

    If this explanation is valid – and the data suggest that we must accept it at least as part of the explanation – then it implies that we should not expect symmetry in climate response between a cooling and a warming Earth. If I follow your argument, you are planning on using one to support parameter estimation from the other under a naive assumption of a linear relationship between the endpoints (ECS and forcing), notwithstanding the nonlinearity of transient response and the hysteresis reported by Professor Held.
    No amount of sophistication in the statistical analysis will resolve the problem of starting off with an erroneous assumption set. Have you tested your assumption of symmetry between cooling and warming, or do you have some super-sophisticated way of accommodating the nonlinearity?

    • Sashka

      That’s a very good point.

      I wonder if they ever ran the model for 500-1000 forward with const forcing. Perhaps it’s not a recalcitrant component but an inherent thermal drift that would eventually show up no matter what?

    • Paul Matthews

      So to put it concisely, this suggests that the naive idea that Delta T = C Delta F doesn’t really hold.

  24. Paul_K

    All of the models have been run out for long periods with constant forcing – at least for the 2*CO2 experiments. They show a continuously increasing temperature, of course, but this numerical experiment does not give you an answer to the question you are posing. Did you mean to suggest running the models out for a long period with zero forcing to see if there is a long-term drift?

    • Sashka

      No, I meant (for example) with pre-industrial CO2 beginning 1860. Or with current CO2 beginning today. Better both and many times (with different initializations).

      If 2xCO2 never reached equilibrium (which is what I assume you’re saying by using the word “continuously”) then I call it inherent thermal drift which will eventually show up in (almost?) every experiment.

  25. BillC

    I think it would be incredible if running the models for a long time at higher CO2 concentrations amplified a warming drift pattern. Paul S noted that some models have cooling drifts, so not sure how that would go.

  26. Paul S

    Your comment echos critiques of the Schmittner et al. paper from last year (e.g. http://julesandjames.blogspot.co.uk/2011/11/schmittner-on-sensitivity.html , http://julesandjames.blogspot.co.uk/2011/11/more-on-schmittner.html , http://www.realclimate.org/index.php/archives/2011/11/ice-age-constraints-on-climate-sensitivity/).

    I’m not sure these apply to Tamsin’s method though. Note that the main post states: ‘We use the ensemble to simulate the Last Glacial Maximum (LGM), the preindustrial period (as a reference), and a climate the same as the preindustrial but with double the CO2 concentrations (to calculate climate sensitivity).’

    In other words, the match to the LGM is used to determine the model’s relationship with reality (here represented by the LGM reconstruction), but the climate sensitivity of interest is determined by reference to respective 2xCO2 experiments.

    • Paul S

      That should have been a reply to Paul_K’s comment (May 5, 2012 – 7:26 am)

    • Paul_K

      Paul S,
      Thanks for the references – very useful. And a good response in terms of a possible defense of Tamsin’s work. It would be nice to hear her views directly.

      • Paul S

        Now that I think about it, I believe Schmittner et al. used the same approach: get an ensemble of model runs with different parameterisations, check them all against the data, then find the models’ 2xCO2 sensitivity.

        The difference is that their model had a fixed atmosphere, so it was incapable of producing non-linearity.

  27. Albert Stienstra

    Dear Tamsin Edwards,
    your question whether an ensemble of (not so good) climate models is better than a single (so-called) good model, or not, is quite irrelevant to me, because ALL current climate models are still elementary.
    My experience in using very complex models for simulating large computer chips is that NO model is able to accurately predict correctly how the system on the chip operates in the real world.
    The reasons are varied, from the fact that the computers the simulations are running on cannot host all the different physical mechanisms to converge in real time to useful results, to the fact that even in the silicon chip world we STILL do not completely know all the physical mechanisms that apply. From time to time we find important issues in practical application that cannot be found by simulation. And, if the cause of a specific problem is sometimes found, the software developers tell us that the extra physical mechanism that would be required to prevent future problems cannot be implemented because it would fatally slow down the simulations. An example of known physics that should but cannot be included is the complete metal and (poly-) silicon connenction mesh with all its physical attributes, to take into account currents and voltages induced in the mesh. Of course I do not have an example of as yet unknown physics that play an important role to predict the operation of a complex analog/digital chip.
    Looking to the climate scientists with their obsession about models and the current state of them I believe that paying a lot of attention to models for predicting the climate future is analogous to counting the number of angels on the tip of a pin. Exisiting instrumental climate data are extremely noisy and collected over a far to small period. Proxies to extend the data collection period to the past are really useless because they increase the noise level orders of magnitude. Climate scientists have a tunnel view about the physics that should be included in their models.
    Of course some models are useful and it is nowadays impossible to design chips without simulating their operation before production, but every designer is very glad when the completed system really works. And NO chip designer believes that his simulator can be used to really predict the future.
    As for climate science; a lot more study and evaluation of the physical mechanisms involved needs to be done before a really useful set of models can be designed. And even then it is unlikely to do much more than improve prediction of short-time weather events. This is economically quite useful, of course, and any new idea, like the one you propose, should be tried. But at this time the whole set of climate models is inadequate.

    [ Sorry this got stuck in moderation — Tamsin ]

  28. Alexander Harvey

    Hi Jonty and Lindsay,

    I think the Boulder “Good Practice Guidance Paper” has a very different notion of “truth plus error” than the one being discussed in your (Jonty et al) paper “Second-order exchangeability analysis for multi-model ensembles”. Critically that the MME sample mean would be convergent on “truth” in the Boulder case, i.e. that it is “truth centred”

    This seems unfortunate as anyone reading your paper, but not following the equations closely, might conclude demonstrating that “truth plus error” is a special restricted case of co-exchangeability was far from being a clever insight by Michael.

    The Boulder “truth centred” interpretation would have I think E(U) = Var(U) = 0 and A=I which is to say that Y = M(X) and hence that Var(U) could not equal Var(R(X)) unless Var(R(X)) = 0 (which would make the notion of an ensemble useless).

    If I am right in this, when you say “truth plus error” a whole lot of people may infer “truth centred” with a real risk of people talking passed each other.

    I can think of nothing, beyond a moment of great good fortune, that would lead me to suspect that an MME would be “truth centred”.

    I will quote the two case’s considered in the Boulder document “IPCC Expert Meeting on Multi Model Evaluation” (page 4):


    Statistical frameworks in published methods using ensembles to quantify uncertainty may assume (perhaps implicitly):

    a. that each ensemble member is sampled from a distribution centered around the truth (‘truth plus error’ view) (e.g., Tebaldi et al., 2005; Greene et al., 2006; Furrer et al., 2007; Smith et al., 2009). In this case, perfect independent models in an ensemble would be random draws from a distribution centered on observations.

    Alternatively, a method may assume:

    b. that each of the members is considered to be ‘exchangeable’ with the other members and with the real system (e.g., Murphy et al., 2007; Perkins et al., 2007; Jackson et al., 2008; Annan and Hargreaves, 2010). In this case, observations are viewed as a single random draw from an imagined distribution of the space of all possible but equally credible climate models and all possible outcomes of Earth’s chaotic processes. A ‘perfect’ independent model in this case is also a random draw from the same distribution, and so is ‘indistinguishable’ from the observations in the statistical model.


    I say that the first is unbelievable, the second very optimistic, and the pair unjustifiable.

    I haven’t looked up the references given so I don’t know how or where “truth plus error” came to imply “random draws from a distribution centered on observations”.

    These two interpretations do heve the advantage that they are much easier to understand than the more general cases discussed in your paper but at the cost of being much harder to justify.

    If people are holding two very different interpretation of the term “truth plus error” that is something that needs fixing.


    • Jonty Rougier

      Hi Alex,

      I’ll just make two brief comments. The truth plus error model states that each ensemble member is a noisy measurement of truth, in the usual form of ensemble = truth + error, where the two terms on the RHS are uncorrelated. Truth in this case is meant to be Y itself (ie actual climate), and the modelling of error can be quite interesting, with the possibility of systematic effects across simulators, and within the vector (eg from the past to the future). The Smith et al paper in JASA is a good example of this. Like you, I find the framework hard to defend. But your radical implementation of truth + error is not the only one; the point about the derivation in our paper is that there are more general implementations within the co-exchangeable framework. The trick is to get the covariances right.

      The description of exchangeability in the Boulder document was not well-written, unfortunately. I think we cover this well in our paper so I won’t rehash it here. The general difficulty is that our view about the practical benefits of exchangeability is rather nuanced, but based on a fundamentally subjective view of structured probabilistic reasoning as a tool, to be adopted where its computational dividends outweigh the suppression of relevant knowledge (which then has to find its way back in another way). Most statisticians take a less subjective and more mechanical attitude to the probability calculus.

      Also, to be frank, I don’t think the climate scientists who have got hold of the notion of ‘exchangeability’ really appreciate what it represents. It is astounding to me that symmetry of judgement can have such profound implications, but which remain, after 80 years, very challenging to prove. In fact, it is considered to be a mystery why something so fundamental remains so intractable. By way of contrast, the excellent Benjamini and Hochberg theorem on controlling the False Discovery Rate in multiple testing was published in 1995 (but imperfectly proved there), and by 2006 we had a one-page proof that could be taught to bright undergraduates.


  29. Hamish McDougal


    Just make sure you avoid the Millikan Effect.

    (I’m sure is has been touched on, briefly and obliquely, in comments).