Gianluca Baio's blog: June 2017

Tuesday 20 June 2017

Picky people

Our book on Bayesian cost-effectiveness analysis using BCEA is out (I think as of last week). This has been a long process (I've talked about this here, here and here).

Today I've come back to the office and have open the package with my copies. The book looks nice $-$ I am only a bit disappointed about a couple of formatting things, specifically the way in which computer code got badly formatted in chapter 4.
We had originally used specific font, but for some reason in that chapter all computer code is formatted in Times New Romans. I think we did check in the proofs and I don't recall seeing this (which, to be fair, isn't necessarily to swear that we didn't miss it, while checking...).

Not a biggie. But it bothers me, a bit. Well, OK: a lot. But then again, I am a(n annoyingly) picky person...

Monday 19 June 2017

Homecoming (of sort...)

I spent last week in Florence for our Summer School. Of course, it was home-coming for me and I really enjoyed being back to Florence $-$ although it was really hot. I would say I'm not used to that level of heat anymore, if it wasn't for the fact that I have caught my brother (who still lives there) huffing and complaining about it several times!...

I think it was a very good week $-$ we had capped the number of participants at 27; everybody showed up and I think had a good time. I think I can speak for myself as well as for Chris, Nicky, Mark and Anna and say that we certainly enjoyed being around people who were so committed and interested! We did joke at several points that we didn't even have to ask the questions $-$ they were starting the discussion almost without us prompting it...

The location was also very good and helped make sure everybody was enjoying it. The Centro Studi in Fiesole is an amazing place $-$ not too close to Florence that people always disappears after the lectures, but not too far either. So there was always somebody there even for dinner and a chat in the beautiful garden, although some people would venture down the hill (notably, many did so by walking!). We also went to Florence a couple of times (the picture is one of my favourite spots of the city, which I obviously brought everybody to...).

Friday 9 June 2017

Surprise?

So: for once I woke up this morning feeling slightly quite tired for the late night, but also rather upbeat after an election. The final results of the general election are out and have produced quite some shock.

Throughout yesterday, it looked as though the final polls were returning an improved majority for the Conservative party $-$ this would have been consistent with the "shy Tory" effect. Even Yougov had presented their latest poll suggesting a seven points lead and improved Tory majority. So I guess many people were unprepared for the exit polls, which suggested a very different figure...

First off, I think that the actual results have vindicated Yougov's model (rather than the poll), based on a hierarchical model informed by over 50,000 individual-level data on voting intention as well as several other covariates. They weren't spot on, but quite close.

Also, the exit polls (based on a sample of over 30,000) were remarkably good. To be fair, however, I think that exit polls are different than the pre-election polls, because unlike them they do not ask about "voting intentions", but the actual vote that people have just cast.

And now, time for the post-mortem. My final prediction using all the polls at June 8th was as follows:

mean sd 2.5% median 97.5% OBSERVED
Conservative 346.827 3.411262 339 347 354 318
Labour 224.128 3.414861 218 224 233 261
UKIP 0.000 0.000000 0 0 0 0
Lib Dem 10.833 2.325622 7 11 15 12
SNP 49.085 1.842599 45 49 51 35
Green 0.000 0.000000 0 0 0 1
PCY 1.127 1.013853 0 2 3 4

Not all bad, but not quite spot on either and to be fair, less spot on than Yougov's (as I said, I was hoping they were closer to the truth than my model, so not too many complaints there!...).

I've thought a bit about the discrepancies and I think a couple of issues stand out:

I (together with several other predictions and in fact even Yougov) have overestimated the vote and, more importantly, the number of seats won by the SNP. I think in my case, the main issue had to do with the polls I have used to build my model. As it has happened, the battleground in Scotland has been rather different than the rest of the country, I think. But what was feeding into my model were the data from national polls. I had tried to bump up my prior for the SNP to counter this effect. But most likely this has exaggerated the result, producing an estimate that was too optimistic.
Interestingly, the error for the SNP is 14 seats; 12 of these, I think, have (rather surprisingly) gone to the Tories. So, basically, I've got the Tory vote wrong by (347-318+12)=41 seats $-$ which if you actually allocate to Labour would have brought my prediction to 224+41=265.
Post-hoc adjustements aside, it is obvious that my model had overestimated the result for the Tories, while underestimating Labour's performance. In this case, I think the problem was that the structure I had used was mainly based on the distinction between leave and remain areas at last year's referendum. And of course, these were highly related to the vote that in 2015 had gone to UKIP. Now: like virtually everybody, I have correctly predicted that UKIP would get "zip, nada, zilch" seats. In my case, this was done by combining the poor performance in the polls with a strongly informative prior (which, incidentally, was not strong enough and combined with the polls, I did overestimate UKIP vote share). However, I think that the aggregate data observed in the polls had consistently tended to indicate that in leave areas the Tories would have had massive gains. What actually happened was in fact that the former UKIP vote has split nearly evenly between the two major parties. So, in strong leave areas, the Tories have gained marginally more than Labour, but that was not enough to swing and win the marginal Labour seats. Conversely, in remain areas, Labour has done really well (as the polls were suggesting) and this has in many cases produced a change in colours in some Conservative marginal seats.
I missed the Green's success in Brighton. This was, I think, down to being a bit lazy and not bothering telling the model that in Caroline Lucas' seat the Lib Dem had not fielded a candidate. This in turn meant that the model was predicting a big surge in the vote for the Lib Dems (because Brighton Pavilion is a strong remain area), which would eat into the Green's majority. And so my model was predicting a change to Labour, which never happened (again, I'm quite pleased to have got it wrong here, because I really like Ms Lucas!).
My model had correctly guessed that the Conservatives would regain Richmond Park, but that the Lib Dems had got back Twickenham and Labour would have held Copeland. In comparison to Electoralcalculus's prediction, I've done very well in predicting the number of seats for the Lib Dems. I am not sure about the details of their model, but I am guessing that they had some strong prior to (over)discount the polls, which has lead to a substantial underestimation. In contrast, I think that my prior for the Lib Dems was spot on.
Back to Yougov's model, I think that the main, huge difference, has been the fact that they could rely on a very large number of individual level data. The published polls would only provide aggregated information, which almost invariably would only cross-tabulate one variable at a time (ie voting intention in Leave vs Remain, or in London vs other areas, etc $-$ but not both). To actually be able to analyse the individual level data (combined of course with a sound modelling structure!) has allowed Yougov to get some of the true underlying trends right, which models based on the aggregated polls simply couldn't, I think.

It's been a fun process $-$ and all in all, I'm enjoying the outcome...

Wednesday 7 June 2017

Break

Today I've taken a break from the general election modelling $-$ well, not really... Of course I've checked whether there were new polls available and have updated the model!

But: nothing much changes, so for today, I'll actually concentrate on something else. I was invited to give a talk at the Imperial/King's College Researchers' Society Workshop $-$ I think this is something they organise routinely.

They asked me to talk about "Blogging and Science Communication" and I decided to have some fun with this. My talk is here. I've given examples of weird stuff associated with this blog $-$ not that I had to look very hard to find many of them...

And I did have fun giving the talk! Of course, the posts about the election did feature, so eventually I got to talk about them to...

Tuesday 6 June 2017

The Inbetweeners

When it first was shown, I really liked "The Inbetweeners" $-$ it was at times quite rude and cheap, but it did make me laugh, despite the fact that, as it often happens, all the main characters did look a bit older than the age they were trying to portrait...

Anyway, as is increasingly often the case, this post has very little to do with its title and (surprise!) it's again about the model for the UK general election.

There has been lots of talk (including in Andrew Gelman's blog) in the past few days about Yougov's new model, which is based on Gelman's MRP (Multilevel Regression and Post-stratification). I think the model is quite cool and it obviously is very rigorous $-$ it considers a very big poll (with over 50,000 responses), assumes some form of exchangeability to pool information across different individual respondents' characteristics (including geographical area) and then reproportions the estimated vote shares (in a similar way to what my model does) to produce an overall prediction of the final outcome.

Much of the hype (particularly in the British mainstream media), however, has been related to the fact that Yougov's model produces a result that is very different from most of the other poll analyses, ie a much worse performance for the Tories, who are estimated to gain only 304 seats (with a 95% credible interval of 265-342). That's even less than the last general election. Labour are estimated to get 266 (230-300) seats and so there have been hints of a hung parliament, come Friday.

Electoralcalculus (EC) has a short article in their home page to explain the differences in their assessment, which (more in line with my model) still gives the Tories a majority of 361 (to Labour's 216).

As for my model, the very latest estimate is the following:

mean sd 2.5% median 97.5%
Conservative 347.870 3.2338147 341 347 355.000
Labour 222.620 3.1742205 216 223 230.000
UKIP 0.000 0.0000000 0 0 0.000
Lib Dem 11.709 2.3103369 7 12 16.000
SNP 48.699 2.0781525 44 49 51.000
Green 0.000 0.0000000 0 0 0.000
PCY 1.102 0.9892293 0 1 2.025
Other 0.000 0.0000000 0 0 0.000

so somewhere in between Yougov and EC (very partisan comment: man how I wish Yougov got it right!).

One of the points that EC explicitly models (although I'm not sure exactly how $-$ the details of their model are not immediately evident, I think) is the poll bias against the Tories. They counter this by (I think) arbitrarily redistributing 1.1% of the vote shares from Labour to the Tories. This probably explains why their model is a bit more favourable to the Conservatives, while being driven by the data in the polls, which seem to suggest Labour are catching up.

I think Yougov model is very extensive and possibly does get it right $-$ after all, speaking only for my own model, Brexit is one of the factors and possibly can act as proxy for many others (age, education, etc). But surely there'll be more than that to make people's mind? Only few more days before we find out...

Friday 2 June 2017

The code (and other stuff...)

I've received a couple of emails or comments on one of the General Election posts to ask me to share the code I've used.

In general, I think this is a bit dirty and lots could be done in a more efficient way $-$ effectively, I'm doing this out of my own curiosity and while I think the model is sensible, it's probably not "publication-standard" (in terms of annotation etc).

Anyway, I've created a (rather plain) GitHub repository, which contains the basic files (including R script, R functions, basic data and JAGS model). Given time (which I'm not given...), I'd like to put a lot more description and perhaps also write a Stan version of the model code. I could also write a more precise model description $-$ I'll try to update the material on the GitHub.

On another note, the previous posts have been syndicated in a couple of places (here and here), which was nice. And finally, here's a little update with the latest data. As of today, the model predicts the following seats distribution.

mean sd 2.5% median 97.5%
Conservative 352.124 3.8760350 345 352 359
Labour 216.615 3.8041091 211 217 224
UKIP 0.000 0.0000000 0 0 0
Lib Dem 12.084 1.8752228 8 12 16
SNP 49.844 1.8240041 45 51 52
Green 0.000 0.0000000 0 0 0
PCY 1.333 0.9513233 0 2 3
Other 0.000 0.0000000 0 0 0

Labour are still slowly but surely gaining some ground $-$ I'm not sure the effect of the debate earlier this week (which was deserted by the PM) are visible yet as only a couple of the polls included were conducted after that.

Another interesting thing (following up on this post) is the analysis of the marginal seats that the model predicts to swing from the 2015 Winners. I've updated the plot, which now looks as below.

Now there are 30 constituencies that are predicted to change hand, many still towards the Tories. I am not a political scientists, so I don't really know all the ins and outs of these, but I think a couple of examples are quite interesting and I would venture some comment...

So, the model doesn't know about the recent by-elections of Copeland and Stoke-on-Trent South and so still label these seats as "Labour" (as they were in 2015), although the Tories have actually now control of Copeland.

In the prediction given the polls and the impact of the EU referendum (both were strong Leave areas with with 60% and 70% of the preference, respectively) and the Tories did well in 2015 (36% vs Labour's 42% in Copeland and 33% to Labour's 39% in 2015). So, the model is suggesting that both are likely to switch to the Tories this time around.

In fact, we know that at the time of the by-election, while Copeland (where the contest was mostly Labour v Tories) did go blue, Stoke didn't. But there, the main battle was between the Labour's and the UKIP's candidate (UKIP had got 21% in 2015). And the by-election was fought last February, when the Tories lead was much more robust that it probably is now.

Another interesting area is Twickenham $-$ historically a constituency leaning to the Lib Dems, which was captured by the Conservatives in 2015. But since then, in another by-election the Tories have lost another similar area (Richmond Park,with a massive swing) and the model is suggesting that Twickenham could follow suit, come next Thursday.

Finally, Clapton was the only seat won by UKIP in 2015, but since then, the elected MP (a former Tory-turned-UKIP) has defected the party and is not contesting the seat. This, combined with the poor standing of UKIP in the polls produces the not surprisingly outcome that Clapton is predicted to go blue with basically no uncertainty...

These results look reasonable to me $-$ not sure how life will turn out of course. As many commentators have noted much may depend on the turn out among the younger. Or other factors. And probably there'll be another instance of the "Shy-Tory effect" (I'll think about this if I get some time before the final prediction). But the model does seem to make some sense...

Gianluca Baio's blog