Gianluca Baio's blog: 2017

Friday 22 December 2017

A Bayesian analysis of polls in the Catalan elections

(Invited post by Virgilio Gómez-Rubio, UCLM, Albacete, Spain. Thanks Gianluca for the invitation!!)

I have been involved in the planning and analysis or survey polls almost since I came back to Albacete 9 years ago. Last months in Spanish politics have been dominated by the 'Catalan referendum' and the call for new elections from the national government via article 155 in the Spanish Constitution (which had never been enforced before). This elections have been different for many reasons, so I decided to do a (last minute) analysis of the available polls to try to predict the allocation of seats in the elections.

The Catalan parliament has 135 seats, split in four electoral districts which correspond to the four provinces in the region, with different number of seats depending on their population: Barcelona (85 seats), Gerona (17 seats), Lérida (15 seats) and Tarragona (18 seats). Seats are allocated according to D'Hondt method.

Several polls have been published in the mass media, and the proportions of votes to parties (as well as sample size, etc.) are either reported at the regional level (which is useless to allocate seats per provinces) and province level. Given that most polls are aggregated at the regional level it makes sense to combine both types of polls into a single model to provide some insight on the voters' preferences at the province level to allocate the number of seats.

Bayesian hierarchical models are great at combining information from different sources. The model that I have considered now is very simple. The number of votes (reported in the poll) to each party at the regional level are assumed to follow a multinomial distribution with probabilities $P_i, i=1,\ldots, p$, where $p$ is the number of political parties. In this case, we have 7 main parties plus another group for 'other parties'. Probabilities $P_i$ are assigned a vague Dirichlet prior. The number of votes at the province level are assumed to follow a multinomial distribution as well, with probabilities $p_{i,j},\ i=1,\ldots,p, j=1,\ldots,4.$. Both probabilities are linked by assuming that $\log(p_{i,j})$ is proportional to $\log(P_i)$ plus a province-party specific random effect $u_{i,j}$. I have used this model before with good results.

As simple as it is, this model allows the combination of polls at different aggregation levels. I have used JAGS to fit the model and to allocate the number of seats by exploiting the probabilities from the MCMC output to obtain 10000 draws of the allocation of seats by applying D'Hont rule to the proportion of votes to each party at the proven level.

Next plot shows the distribution of seats against the actual distribution of seats:

I'd say the coverage is good for most parties. Polls did not show the loss of voters for CUP and Partido Popular (PP).

Another nice thing of being Bayesian (and using MCMC) is that other probabilities could be computed. For example, the next plot shows the posterior distribution of the number of seats allocated to pro-independence parties so that the probability of them having a majority can be computed (59.86%):

As I promised to have a shot for each seat allocated correctly, I've got some work left to do until the end of the Christmas break... Merry Christmas and Happy New Year!!!

Tuesday 19 December 2017

Does Peppa Pig encourage inappropriate use of primary care resources?

This is a very important contribution to the medical literature, recently published in the BMJ.

I think the sample size is probably not large enough to grant robust inference. And perhaps it would have been helpful to consider alternative settings, say to consider the wide diversity in the target population of Ben and Holly's little Kingdom, just to give an example.

But I do applaud the effort of the author!

Monday 18 December 2017

Unpleasantville

Last week, Kristian Lum has written a blog post to report her experience of inappropriate behaviour by some senior male colleagues at statistical conferences (ISBA and JSM, in particular).

I don't personally know Kristian, although I think I did have lunch with her, a common friend and bunch of other people, at JSM in Montreal in 2013. Anyway, even if I were completely agnostic about the whole thing (and I don't think I am...), seems to me like her account has been corroborated by some hard facts as well as discussion with other friends/colleagues who actually know her rather well. So while it's important to avoid "courts martial", I think the discussion here isn't really about whether these things happened or not (which at this point I'm pretty sure they did $-$ just to clarify).

I've been left with mixed feelings and a sense of kind-of-having lost my bearings, since I found this out last week. Firstly, I am not surprised to hear that such things can happen at a conference or in academia, in general. What has kind of surprised me is the fact that while I do move more or less in those circles, I wasn't aware of the reputation of the two people who have been named. Some people (for example here) have made a point that these stories were well known and Kristian said so herself in her blog post. As somebody who's involved in ISBA, this is troubling and I kind of feel like we've hid our collective head under the sand, possibly for a very long time. To be fair, ISBA is now coming up with a task-force to create protocols and prevent issues such as these arising again in the future. Still, doesn't feel particularly good...

Secondly, this may be some sort of self-preservation (or may be denial?) instinct and may be there is indeed a much more rooted problem in statistics and in fact in Bayesian statistics, which I make myself struggle to see because it hurts to think that the environment in which I work is actually flawed in bad ways. But what I mean is that perhaps it's not like there's a couple of areas in which bad guys operate and if only we could get rid of those bad guys in those areas, then society would be idyllic. I think that, unfortunately, there's plenty of examples where people with/in power (statistically more likely to be white men) do behave badly and abuse their power in many ways, including sexually. May be our field does represent men disproportionately $-$ and it may well be that this is even truer for Bayesian statistics than for other branches of statistical science. And so, as painful as it is to realise quite clearly that the grass ain't so green after all, it is what it is. But the problem is (much) bigger than that...

Finally, I've particularly liked my friend Julien's Facebook post (I actually see now that he was in fact linking to somebody else's tweet):

Retweeted Carlos Scheidegger (@scheidegger):
We should all read and acknowledge @KLdivergence's and other women's harrowing stories. But I want to try something different here. Do you all know of her amazing work at @hrdag? This, on predictive policing, is so good https://t.co/YDsijFsiT2 https://t.co/GbwgKzSgMb

Dan's post has some lengthy discussion about the use of the term "mediocre" to characterise the two offenders. I think that neither mediocrity (= how poor one is at their work) nor excellence (= how good one is at their work) should be excuses $-$ but I see how this may matter because, arguably, the better and more respected you are in your field, the more power you wield over junior colleagues... But I think it feels right to point out Kristian's work qualities. Somehow, it seems to put things in a better perspective, I think.

Tuesday 21 November 2017

βCEA

Recently, I've been doing a lot of work on the beta version of BCEA (I was after all born in Agrigento $-$ in the picture to the left $-$, which is a Greek city, so a beta version sounds about right...).

The new version is only available as a beta-release from our GitHub repository - usual ways to install it are through the devtools package.

There aren't very many changes from the current CRAN version, although the one thing I did change is kind of big. In fact, I've embedded the web-app functionalities within the package. So, it is now possible to launch the web-app from the current R session using the new function BCEAweb. This takes as arguments three inputs: a matrix e containing $S$ simulations for the measures of effectiveness computed for the $T$ interventions; a matrix c containing the simulations for the measures of costs; and a data frame or matrix containing simulations for the model parameters.

In fact, none of the inputs is required and the user can actually launch an empty web-app, in which the inputs can be uploaded, say, from a spreadsheet (there are in fact other formats available).

I think the web-app facility is not necessary when you've gone through the trouble of actually installing the R package and you're obviously using it from R. But it's helpful, nonetheless, for example in terms of producing some standard output (perhaps even more than the actual package $-$ which I think is more flexible) and of reporting, with the cool facility based on pandoc.

This means there are a few more packages "suggested" on installation and potentially a longer compilation time for the package $-$ but nothing major. The new version is under testing but I may be able to release it on CRAN soon-ish... And there are other cool things we're playing around (the links here give all the details!).

Monday 20 November 2017

La lotteria dei rigori

Seems like my own country has kind of run out of luck... First we fail to qualify for the World Cup, then lose the right to host the relocated headquarters of the European Medicine Agency, post Brexit. If I were a cynic ex-pat, I'd probably think that the former will be felt like the worst defeat across Italy. May be it will.

As I've mentioned here, I'd been talking to Politico, about how the whole process looked like the Eurovision. I think the actual thing did have some elements $-$ earlier today, on the eve of the vote, it appeared like Bratislava was the hot favourite. This kind of reminded me of the days before the final of the Eurovision, when one of the acts is often touted as the sure-thing, often over and above its musical quality. And I do believe that there's an element of "letting people know that we're up for hosting the next one" going on to pimp up the experts' opinions. Although sometimes, as it turns out, the favourites are not so keen in reality $-$ cue their poor performance come the actual thing...

In the event, Bratislava was eliminated at the first round. The contest went all the way to extra times, with Copenhagen dropping out at the semifinals and Amsterdam-Milan contesting the final head-to-head. As the two finalists got the same number of votes (with I think one abstaining), the decision was made on luck $-$ basically on penalties, or as we say in Italian, la lotteria dei rigori.

I guess there must have been some thinking behind the set-up of the voting system that, in case it came down to a tie at the final round, both remaining candidates would be "acceptable" (if not to everybody, at least to the main players) and so they'd be happy for this to go 50:50. And so Amsterdam it is!

Tuesday 14 November 2017

Relocation, relocation, relocation

Earlier today, I was contacted by Politico $-$ they are covering the story about the European Union's process to reassign the two EU agencies currently located in London, the European Medicines Agency, (EMA) and the European Banking Authority (EBA) post-Brexit.

I know of this, but wasn't aware of the actual process, which is kind of complex:

"The vote for each agency will consist of successive voting rounds, with the votes cast by secret ballot. In the first round, each member state will have one vote consisting of six voting points, which should be allocated in order of preference to three offers: three points to the first preference, two to the second and one to the third. If one offer receives three voting points from at least 14 member states, this will be considered the selected offer. Otherwise, the three offers (or more in case of a tie) with the highest number of points will go to a second round of voting. In the second round, each member state will have one voting point, which should be allocated to its preferred option in that round. If one offer receives 14 or more votes, it will be considered the selected offer. Otherwise, a third round will follow among the two offers (or more in case of a tie) with the highest number of votes, again with one voting point per member state. In the event of a tie, the presidency will draw lots between the tied offers."

Cat Contiguglia has contacted me to have a chat about this $-$ they had done a couple of pieces likening the resemblance with the Eurovision contest. As I told Cat, however, I think this is more like the way cities get assigned the right to host the Olympic Games, or even how the Palio di Siena works... I guess lots of discussion is already going on among the member states.

Apparently, Milan and Frankfurt are the favourites to host EMA and EBA, respectively. I think I once heard a story that, originally, EMA was supposed to be located in Rome. Unfortunately, the decision was to be made just as one of the many Italian political scandal was about to uncover, pointing to massive corruption in the Italian healthcare system and so Rome was stripped of the title. Perhaps a win for Milan will help Italy get over the World Cup...

Friday 10 November 2017

At the Oscars!

Well, these days being part of the glittering world of show-biz is not necessarily a good thing, but when your life is soooo glamorous that someone feels the unstoppable need to make a biopic of it... well, you really need to embrace your new status as a movie star and enjoy all the perks that life will now throw at you...

I know, I know... This is still about the Eurovision. But, this time they made a short video to tell the story $-$ you may think the still above hows Marta and me, but these are actually two actors, playing us!

I think they've made a very good job at rendering us $-$ particularly me, I think. If you believe the movies:

We (particularly I) are younger than we really are;
We drink a lot (although "Marta"'s drink 25 seconds in looks like a cross between Cranberry juice and the stuff they use to show vampires drinking human blood from hospital blood bags)...
We laugh a lot $-$ I think this is kind of true, though...
I like how 1 min 24 seconds in, "Marta" authoritatively demands a kiss on the cheek and "my" response to that is covered by floating webpages $-$ kind of rated R...
The storyline seems to suggest that we thought about doing this as wondered whether we should do a Bayesian model $-$ of course that was never in question!...

Anyway, I think I need to thank the guys at Taylor & Francis (Clare Dodd, in particular), who've done an amazing job!

Tuesday 17 October 2017

The Alan Turing's project

The Alan Turing Institute (ATI) has just announced the next round of Doctoral Studentships.

Here's the original blurb with all the relevant information. The guys in the picture are not part of the supervisory teams (but I think I will be...).

We are seeking highly talented and motivated graduates to apply for our fully funded doctoral studentship scheme commencing October 2018 and welcome applications from home/EU and international students.

We are the national institute for data science, created in 2015 in response to a need for greater investment in data science research. Headquartered at the British Library in the heart of London’s vibrant Knowledge Quarter, the Institute was founded by the universities of Cambridge, Edinburgh, Oxford, University College London and Warwick – and the UK Engineering and Physical Sciences Research Council.

The Turing 2018 Doctoral Studentships are an exceptional opportunity for talented individuals looking to embark on a career in the rapidly emerging field of data science.

Why Turing?

Turing students will have access to a wide range of benefits unique to the Institute:

Expert-led training in research disciplines central to data science

Access to a range of events, seminars, reading groups and workshops delivered by leaders in research, government and industry

Opportunities to collaborate on real world projects for societal impact with our current and emerging industry partners

Expert support and guidance through all stages of the studentship delivered by supervisors who are Fellows of the Turing or substantively engaged with us

Access to brilliant minds researching a range of subjects with opportunities to collaborate and join or start interest groups

Networking opportunities through the Institute, university and strategic partners

Supercharge and speed up research with access to cutting edge IT resources; Azure cloud (with credits), Cray Supercomputer, EPSRC Funded Tier 2 HPCs, High Spec local desktops with GPUs and UK Supercomputer Archer

Bespoke HQ designed for optimal study and inter disciplinary collaborations

Studentships include a tax-free stipend of £20,500 per annum (up to 3.5-years), plus home/EU tuition fees and a travel allowance. A limited number of fully-funded overseas studentships are also available.

Additional studentships may be available through our Strategic Partners – HSBC, Intel, Lloyds’ Register Foundation and UK Government – Defence & Security with projects aligned to our strategic priorities.

In line with the Institute’s cross-disciplinary research community, we particularly welcome applications from graduates whose research spans multiple disciplines and applications.

View list of research areas and strategic priorities for the Turing: http://bit.ly/researchareas

Further details: turing.ac.uk/opportunities/studentships

Application deadline: 12:00 GMT Thursday 30 November 2017

Monday 9 October 2017

Summer school in Leuven

Emmanuel has organised earlier this year the first edition of the Summer School on Advanced Bayesian Methods, in the beautiful Belgian town of Leuven (which is also where we had our Bayes conference a couple of years ago).

For next year, they have planned the second edition, which will run from 24th to 28th September and I'm thrilled that they have invited me to do the second part on... you guessed it: Bayesian Methods in Health Economic Evaluation.

The programme is really interesting and Mike Daniels will do the first three days on Bayesian Parametric and Nonparametric Methods for Missing Data and Causal Inference.

Wednesday 27 September 2017

24. Nearly.

As the academic year is beginning (our courses will officially start next week), this week has seen the arrival of our new students, including those in our MSc Health Economics & Decision Science (I've talked about this here and here).

When we set out the planning, we were a bit nervous because, while everybody at UCL has been very encouraging and supportive, we were also given a rather hard target $-$ get at least 12 students, or else this is not viable. (I don't think we were actually told what would have happened if we had recruited fewer students. But I don't think we cared to ask $-$ the tone seemed scary enough)...

Well, as it happens, we've effectively doubled the target and we now have 22 students starting on the programme $-$ there may be a couple more additions, but even if they fail to turn up, I think Jolene, Marcos and I will count ourselves very happy! I've spoken to some of the students yesterday and earlier today and they all seem very enthusiastic, which is obviously very good!

Related to this, we'll soon start our new seminar series, to which all the MSc students are "strongly encouraged" to participate. But I'll post more generally in case they may be of interest to a wider audience...

Friday 8 September 2017

Building the EVSI

Anna and I have just arxived a paper (that we've also submitted to Value in Health), in which we're trying to publicise more widely and in a less technical way the "Moment Matching" method (which we sent to MDM and should be on track and possibly out soon...) to estimate the Expected Value of Sample Information.

The main point of this paper is to showcase the method and highlight its usability $-$ we are also working on computational tools that we'll use to simplify and generalise the analysis. It's an exciting project, I think and luckily we've got our hands on data and designs for some real studies, so we can play around them, which is also nice. I'll post more soon.

Anna has suggested the title of the paper with Bob the builder in mind (so "Can we do it? Yes we can"), although perhaps President Obama (simply "Yes we can") may have worked better. Either way, the picture to the left is just perfect for when we turn this into a presentation...

Thursday 7 September 2017

Planes, trains and automobiles

For some reason, Kobi's favourite thing in the world is flying on an airplane, with making paper airplanes a very closed second and playing airport pretending to check (real) suitcases in and setting off through security as a rather close third.

So it's not surprising that he was quite upset when I told him I would go on an airplane not once, not twice, but three times in the space of just a couple of weeks (in fact, I'll fly to Pisa, then Paris, come back on a train, ride a train again to Brussels and back and finally fly to Bologna and back, all to give talks at several places. From Bologna, I'll actually need to hire a car, because my talk is in nearby Parma).

I think for a moment Kobi did consider stop loving me. But luckily, I think the crisis has been averted and I got him back on good terms when I told him it's not too long until he can fly again...

Yesterday I was Glasgow to give a talk at the Conference of the Royal Statistical Society in the first leg of my September travels-for-talks. My talk was in a session on missing data in health economic evaluation, with Andrew Briggs and James Carpenter also speaking. I think the session was really interesting and we had a rather good audience, so I was pleased with that.

My talk was basically stealing from Andrea's PhD work $-$ we (this includes also Alexina and Rachael who are co-supervising the project) have been doing some interesting stuff on modelling costs and benefit individual level data accounting for correlation between the outcomes; skeweness in the distributions; and "structural" values (eg spikes at QALY values of 1, which cannot be modelled directly using a Beta distribution).

Andrea has done some very good work also in programming the relevant functions in BUGS/JAGS (and he's having a stub at Stan too) into a beta-version of what we'll be our next package (we have called it missingHE) $-$ I'll say more on this when we have a little more established material ready.

The next trip is to Paris on Monday to give a talk at the Department of Biostatistics, in the Institut Gustav Roussy, where I'll speak about (you guessed it...) Bayesian methods in health economics. I'll link to my presentation (that is when I'm finished tweaking it...).

Wednesday 30 August 2017

A couple of things...

Just a couple of interesting things...

1. Petros sends me this advert for a post as Biostatistician at the Hospital for Sick Children in Toronto

The Child Health Evaluative Sciences Program at the Hospital for Sick Children in Toronto is recruiting a PhD Biostatistician to lead the execution of a CIHR funded clinical trial methodology project, and the planning of upcoming trials with a focus on:

improving and using methods of Bayesian Decision analysis and Value of Information in pediatric trial design and analysis;
using patient and caregiver preference elicitation methods (e.g. discrete choice experiments) in pediatrics;
developing of statistical plan and conducting the statistical analysis for pediatric clinical trials.

The Biostatistician will collaborate with the Principal Investigators (PIs) of four trials that are in the design stage, and with two senior biostatisticians and methodologists within the CHES program. The successful candidate will have protected time for independent methods development. A cross appointment with the Dalla Lana School of Public Health at the University of Toronto will be sought.

Here’s What You’ll Get To Do:

In collaboration with the trials’ Principal Investigators (PIs), develop the study protocols;

Contribute in the conceptualization and development of decision analytic models;

Contribute in conducting literature reviews and keep current with study literature;

Assist with design/development and implementation of value of information methods;

Contribute to preparation of reports, presentations, and manuscripts.

Here’s What You’ll Need:

Graduate degree in Statistics, Biostatistics, Health Economics or a related discipline;

Ability to function independently yet collaboratively within a team;

Excellent statistical programming skills predominantly using R software;

Experience with report and manuscript writing;

Effective communication, interpersonal, facilitation and organizational skills;

Meticulous attention to detail.

Employment Type:

Temporary, Full-Time (3 year contract with possibilities for renewal)

Contacts: Dr. Petros Pechlivanoglou and Dr. Martin Offringa

2. And Manuel has an advert for a very interesting short course on Missing Data in health economic evaluations (I will do my bit on Bayesian methods to do this, which is also very much related to the talk I'll give at the RSS conference in Glasgow, later in September $-$ this is part of Andrea's PhD work). I'll post more on this later.

Two-day short course: Methods for addressing missing data in health economic evaluation

Dates: 21-22 September, 2017

Venue: University College London

Overview

Missing data are ubiquitous in health economic evaluation. The major concern that arises with missing data is that individuals with missing information tend to be systematically different from those with complete data. As a result, cost-effectiveness inferences based on complete cases are often misleading. These concerns face health economic evaluation based on a single study, and studies that synthesise data from several sources in decision models. While accessible, appropriate methods for addressing the missing data are available in most software packages, their uptake in health economic evaluation has been limited.

Taught by leading experts in missing data methodology, this course offers an in-depth description of both introductory and advanced methods for addressing missing data in economic evaluation. These will include multiple imputation, hierarchical approaches, sensitivity analysis using pattern mixture models and Bayesian methods. The course will introduce the statistical concepts and underlying assumptions of each method, and provide extensive guidance on the application of the methods in practice. Participants will engage in practical sessions illustrating how to implement each technique with user-friendly software (Stata).

At the end of the course, the participants should be able to develop an entire strategy to address missing data in health economic studies, from describing the problem, to choosing an appropriate statistical approach, to conducting sensitivity analysis to standard missing data assumptions, to interpreting the cost-effectiveness results in light of those assumptions.

Who should apply?

The course is aimed at health economists, statisticians, policy advisors or other analysts with an interest in health economic evaluation, who would like to expand their toolbox. It is anticipated that participants will be interested in undertaking or interpreting cost-effectiveness analyses that use patient-level data, either from clinical trials or observational data.

Course fees: £600 (Commercial/Industry); £450 (Public sector); £200 (Students); payable by the 8th September 2017.

To register for the course or for further information, please see here.

Monday 14 August 2017

When simple becomes complicated...

A while ago, Anna and I published an editorial in Global & Regional Health Technology Assessment. In the paper, we discuss one of my favourite topics $-$ how models for health technology assessment and cost-effectiveness analysis should increasingly move away from using spreadsheet (basically, Excel) and towards proper statistical software.

The main arguments that historically have been used to support spreadsheet-based modelling are those of "simplicity and transparency" $-$ which really grinds my gears. In the paper we also argue that, may be, as statisticians we should invest in efforts towards designing our models using user-interfaces, or GUIs $-$ the obvious example is web-apps. This would expand and extend work done, eg in SAVI, or BCEAweb or bmetaweb, just to name a few (that I'm more familiar with...).

Friday 28 July 2017

Picky people (2)

I've complained here about the fonts for some parts of the computer code in our book . Eva (our publisher) has picked up on this and has been brilliant and very quick in trying to fix the issue. I think they will update the fonts so that at least on the ebooks version all will look nice!

Friday 7 July 2017

Conflict of interest

I am fully aware that this post is seriously affected by a conflict of interest, because what I'm about to discuss (in positive terms!) is work by Anthony, who's doing a very good job on his PhD (which I co-supervise).

But, I thought I'd do like our former PM (BTW: see this; I really liked the series) and sort conflict of interests by effectively ignoring them (to be fair, this seems to be a popular strategy, so let's not be too harsh on Silvio...).

Anyway, Anthony has written an editorial, which has received some traction in the mainstream media (for example here, here or here). Not much that I disagree with in Anthony's piece, except that I am really sceptical of any bake & eat situation $-$ the only exception is when I actually make pizza from scratch...

Tuesday 20 June 2017

Picky people

Our book on Bayesian cost-effectiveness analysis using BCEA is out (I think as of last week). This has been a long process (I've talked about this here, here and here).

Today I've come back to the office and have open the package with my copies. The book looks nice $-$ I am only a bit disappointed about a couple of formatting things, specifically the way in which computer code got badly formatted in chapter 4.
We had originally used specific font, but for some reason in that chapter all computer code is formatted in Times New Romans. I think we did check in the proofs and I don't recall seeing this (which, to be fair, isn't necessarily to swear that we didn't miss it, while checking...).

Not a biggie. But it bothers me, a bit. Well, OK: a lot. But then again, I am a(n annoyingly) picky person...

Monday 19 June 2017

Homecoming (of sort...)

I spent last week in Florence for our Summer School. Of course, it was home-coming for me and I really enjoyed being back to Florence $-$ although it was really hot. I would say I'm not used to that level of heat anymore, if it wasn't for the fact that I have caught my brother (who still lives there) huffing and complaining about it several times!...

I think it was a very good week $-$ we had capped the number of participants at 27; everybody showed up and I think had a good time. I think I can speak for myself as well as for Chris, Nicky, Mark and Anna and say that we certainly enjoyed being around people who were so committed and interested! We did joke at several points that we didn't even have to ask the questions $-$ they were starting the discussion almost without us prompting it...

The location was also very good and helped make sure everybody was enjoying it. The Centro Studi in Fiesole is an amazing place $-$ not too close to Florence that people always disappears after the lectures, but not too far either. So there was always somebody there even for dinner and a chat in the beautiful garden, although some people would venture down the hill (notably, many did so by walking!). We also went to Florence a couple of times (the picture is one of my favourite spots of the city, which I obviously brought everybody to...).

Friday 9 June 2017

Surprise?

So: for once I woke up this morning feeling slightly quite tired for the late night, but also rather upbeat after an election. The final results of the general election are out and have produced quite some shock.

Throughout yesterday, it looked as though the final polls were returning an improved majority for the Conservative party $-$ this would have been consistent with the "shy Tory" effect. Even Yougov had presented their latest poll suggesting a seven points lead and improved Tory majority. So I guess many people were unprepared for the exit polls, which suggested a very different figure...

First off, I think that the actual results have vindicated Yougov's model (rather than the poll), based on a hierarchical model informed by over 50,000 individual-level data on voting intention as well as several other covariates. They weren't spot on, but quite close.

Also, the exit polls (based on a sample of over 30,000) were remarkably good. To be fair, however, I think that exit polls are different than the pre-election polls, because unlike them they do not ask about "voting intentions", but the actual vote that people have just cast.

And now, time for the post-mortem. My final prediction using all the polls at June 8th was as follows:

mean sd 2.5% median 97.5% OBSERVED
Conservative 346.827 3.411262 339 347 354 318
Labour 224.128 3.414861 218 224 233 261
UKIP 0.000 0.000000 0 0 0 0
Lib Dem 10.833 2.325622 7 11 15 12
SNP 49.085 1.842599 45 49 51 35
Green 0.000 0.000000 0 0 0 1
PCY 1.127 1.013853 0 2 3 4

Not all bad, but not quite spot on either and to be fair, less spot on than Yougov's (as I said, I was hoping they were closer to the truth than my model, so not too many complaints there!...).

I've thought a bit about the discrepancies and I think a couple of issues stand out:

I (together with several other predictions and in fact even Yougov) have overestimated the vote and, more importantly, the number of seats won by the SNP. I think in my case, the main issue had to do with the polls I have used to build my model. As it has happened, the battleground in Scotland has been rather different than the rest of the country, I think. But what was feeding into my model were the data from national polls. I had tried to bump up my prior for the SNP to counter this effect. But most likely this has exaggerated the result, producing an estimate that was too optimistic.
Interestingly, the error for the SNP is 14 seats; 12 of these, I think, have (rather surprisingly) gone to the Tories. So, basically, I've got the Tory vote wrong by (347-318+12)=41 seats $-$ which if you actually allocate to Labour would have brought my prediction to 224+41=265.
Post-hoc adjustements aside, it is obvious that my model had overestimated the result for the Tories, while underestimating Labour's performance. In this case, I think the problem was that the structure I had used was mainly based on the distinction between leave and remain areas at last year's referendum. And of course, these were highly related to the vote that in 2015 had gone to UKIP. Now: like virtually everybody, I have correctly predicted that UKIP would get "zip, nada, zilch" seats. In my case, this was done by combining the poor performance in the polls with a strongly informative prior (which, incidentally, was not strong enough and combined with the polls, I did overestimate UKIP vote share). However, I think that the aggregate data observed in the polls had consistently tended to indicate that in leave areas the Tories would have had massive gains. What actually happened was in fact that the former UKIP vote has split nearly evenly between the two major parties. So, in strong leave areas, the Tories have gained marginally more than Labour, but that was not enough to swing and win the marginal Labour seats. Conversely, in remain areas, Labour has done really well (as the polls were suggesting) and this has in many cases produced a change in colours in some Conservative marginal seats.
I missed the Green's success in Brighton. This was, I think, down to being a bit lazy and not bothering telling the model that in Caroline Lucas' seat the Lib Dem had not fielded a candidate. This in turn meant that the model was predicting a big surge in the vote for the Lib Dems (because Brighton Pavilion is a strong remain area), which would eat into the Green's majority. And so my model was predicting a change to Labour, which never happened (again, I'm quite pleased to have got it wrong here, because I really like Ms Lucas!).
My model had correctly guessed that the Conservatives would regain Richmond Park, but that the Lib Dems had got back Twickenham and Labour would have held Copeland. In comparison to Electoralcalculus's prediction, I've done very well in predicting the number of seats for the Lib Dems. I am not sure about the details of their model, but I am guessing that they had some strong prior to (over)discount the polls, which has lead to a substantial underestimation. In contrast, I think that my prior for the Lib Dems was spot on.
Back to Yougov's model, I think that the main, huge difference, has been the fact that they could rely on a very large number of individual level data. The published polls would only provide aggregated information, which almost invariably would only cross-tabulate one variable at a time (ie voting intention in Leave vs Remain, or in London vs other areas, etc $-$ but not both). To actually be able to analyse the individual level data (combined of course with a sound modelling structure!) has allowed Yougov to get some of the true underlying trends right, which models based on the aggregated polls simply couldn't, I think.

It's been a fun process $-$ and all in all, I'm enjoying the outcome...

Wednesday 7 June 2017

Break

Today I've taken a break from the general election modelling $-$ well, not really... Of course I've checked whether there were new polls available and have updated the model!

But: nothing much changes, so for today, I'll actually concentrate on something else. I was invited to give a talk at the Imperial/King's College Researchers' Society Workshop $-$ I think this is something they organise routinely.

They asked me to talk about "Blogging and Science Communication" and I decided to have some fun with this. My talk is here. I've given examples of weird stuff associated with this blog $-$ not that I had to look very hard to find many of them...

And I did have fun giving the talk! Of course, the posts about the election did feature, so eventually I got to talk about them to...

Tuesday 6 June 2017

The Inbetweeners

When it first was shown, I really liked "The Inbetweeners" $-$ it was at times quite rude and cheap, but it did make me laugh, despite the fact that, as it often happens, all the main characters did look a bit older than the age they were trying to portrait...

Anyway, as is increasingly often the case, this post has very little to do with its title and (surprise!) it's again about the model for the UK general election.

There has been lots of talk (including in Andrew Gelman's blog) in the past few days about Yougov's new model, which is based on Gelman's MRP (Multilevel Regression and Post-stratification). I think the model is quite cool and it obviously is very rigorous $-$ it considers a very big poll (with over 50,000 responses), assumes some form of exchangeability to pool information across different individual respondents' characteristics (including geographical area) and then reproportions the estimated vote shares (in a similar way to what my model does) to produce an overall prediction of the final outcome.

Much of the hype (particularly in the British mainstream media), however, has been related to the fact that Yougov's model produces a result that is very different from most of the other poll analyses, ie a much worse performance for the Tories, who are estimated to gain only 304 seats (with a 95% credible interval of 265-342). That's even less than the last general election. Labour are estimated to get 266 (230-300) seats and so there have been hints of a hung parliament, come Friday.

Electoralcalculus (EC) has a short article in their home page to explain the differences in their assessment, which (more in line with my model) still gives the Tories a majority of 361 (to Labour's 216).

As for my model, the very latest estimate is the following:

mean sd 2.5% median 97.5%
Conservative 347.870 3.2338147 341 347 355.000
Labour 222.620 3.1742205 216 223 230.000
UKIP 0.000 0.0000000 0 0 0.000
Lib Dem 11.709 2.3103369 7 12 16.000
SNP 48.699 2.0781525 44 49 51.000
Green 0.000 0.0000000 0 0 0.000
PCY 1.102 0.9892293 0 1 2.025
Other 0.000 0.0000000 0 0 0.000

so somewhere in between Yougov and EC (very partisan comment: man how I wish Yougov got it right!).

One of the points that EC explicitly models (although I'm not sure exactly how $-$ the details of their model are not immediately evident, I think) is the poll bias against the Tories. They counter this by (I think) arbitrarily redistributing 1.1% of the vote shares from Labour to the Tories. This probably explains why their model is a bit more favourable to the Conservatives, while being driven by the data in the polls, which seem to suggest Labour are catching up.

I think Yougov model is very extensive and possibly does get it right $-$ after all, speaking only for my own model, Brexit is one of the factors and possibly can act as proxy for many others (age, education, etc). But surely there'll be more than that to make people's mind? Only few more days before we find out...

Friday 2 June 2017

The code (and other stuff...)

I've received a couple of emails or comments on one of the General Election posts to ask me to share the code I've used.

In general, I think this is a bit dirty and lots could be done in a more efficient way $-$ effectively, I'm doing this out of my own curiosity and while I think the model is sensible, it's probably not "publication-standard" (in terms of annotation etc).

Anyway, I've created a (rather plain) GitHub repository, which contains the basic files (including R script, R functions, basic data and JAGS model). Given time (which I'm not given...), I'd like to put a lot more description and perhaps also write a Stan version of the model code. I could also write a more precise model description $-$ I'll try to update the material on the GitHub.

On another note, the previous posts have been syndicated in a couple of places (here and here), which was nice. And finally, here's a little update with the latest data. As of today, the model predicts the following seats distribution.

mean sd 2.5% median 97.5%
Conservative 352.124 3.8760350 345 352 359
Labour 216.615 3.8041091 211 217 224
UKIP 0.000 0.0000000 0 0 0
Lib Dem 12.084 1.8752228 8 12 16
SNP 49.844 1.8240041 45 51 52
Green 0.000 0.0000000 0 0 0
PCY 1.333 0.9513233 0 2 3
Other 0.000 0.0000000 0 0 0

Labour are still slowly but surely gaining some ground $-$ I'm not sure the effect of the debate earlier this week (which was deserted by the PM) are visible yet as only a couple of the polls included were conducted after that.

Another interesting thing (following up on this post) is the analysis of the marginal seats that the model predicts to swing from the 2015 Winners. I've updated the plot, which now looks as below.

Now there are 30 constituencies that are predicted to change hand, many still towards the Tories. I am not a political scientists, so I don't really know all the ins and outs of these, but I think a couple of examples are quite interesting and I would venture some comment...

So, the model doesn't know about the recent by-elections of Copeland and Stoke-on-Trent South and so still label these seats as "Labour" (as they were in 2015), although the Tories have actually now control of Copeland.

In the prediction given the polls and the impact of the EU referendum (both were strong Leave areas with with 60% and 70% of the preference, respectively) and the Tories did well in 2015 (36% vs Labour's 42% in Copeland and 33% to Labour's 39% in 2015). So, the model is suggesting that both are likely to switch to the Tories this time around.

In fact, we know that at the time of the by-election, while Copeland (where the contest was mostly Labour v Tories) did go blue, Stoke didn't. But there, the main battle was between the Labour's and the UKIP's candidate (UKIP had got 21% in 2015). And the by-election was fought last February, when the Tories lead was much more robust that it probably is now.

Another interesting area is Twickenham $-$ historically a constituency leaning to the Lib Dems, which was captured by the Conservatives in 2015. But since then, in another by-election the Tories have lost another similar area (Richmond Park,with a massive swing) and the model is suggesting that Twickenham could follow suit, come next Thursday.

Finally, Clapton was the only seat won by UKIP in 2015, but since then, the elected MP (a former Tory-turned-UKIP) has defected the party and is not contesting the seat. This, combined with the poor standing of UKIP in the polls produces the not surprisingly outcome that Clapton is predicted to go blue with basically no uncertainty...

These results look reasonable to me $-$ not sure how life will turn out of course. As many commentators have noted much may depend on the turn out among the younger. Or other factors. And probably there'll be another instance of the "Shy-Tory effect" (I'll think about this if I get some time before the final prediction). But the model does seem to make some sense...