Tuesday, 25 April 2017

Snap

In the grand tradition of all recent election times, I've decided to have a go and try and build a model that could predict the results of the upcoming snap general election in the UK. I'm sure there will be many more people having a go at this, from various perspectives and using different modelling approaches. Also, I will try very hard to not spend all of my time on this and so I have set out to develop a fairly simple (although, hopefully reasonable) model.

First off: the data. I think that since the announcement of the election, the pollsters have intensified the number of surveys; I have found already 5 national polls (two by Yougov, two by ICM and one by Opinium $-$ there may be more and I'm not claiming a systematic review/meta-analysis of the polls.

Arguably, this election will be mostly about Brexit: there surely will be other factors, but because this comes almost exactly a year after the referendum, it is a fair bet to suggest that how people felt and still feel about its outcome will also massively influence the election. Luckily, all the polls I have found do report data in terms of voting intention, broken up by Remain/Leave. So, I'm considering $P=8$ main political parties: Conservatives, Labour, UKIP, Liberal Democrats, SNP, Green, Plaid Cymru and "Others". Also, for simplicity, I'm considering only England, Scotland and Wales $-$ this shouldn't be a big problem, though, as in Northern Ireland elections are generally a "local affair", with the mainstream parties not playing a significant role.

I also have available data on the results of both the 2015 election (by constituency and again, I'm only considering the $C=632$ constituencies in England, Scotland and Wales $-$ this leaves out the 18 Northern Irish constituencies) and the 2016 EU referendum. I had to do some work to align these two datasets, as the referendum did not consider the usual geographical resolution. I have mapped the voting areas used 2016 to the constituencies and have recorded the proportion of votes won by the $P$ parties in 2015, as well as the proportion of Remain vote in 2016.

For each observed poll $i=1,\ldots,N_{polls}$, I modelled the observed data among "$L$eavers" as $$y^{L}_{i1},\ldots,y^{L}_{iP} \sim \mbox{Multinomial}\left(\left(\pi^{L}_{1},\ldots,\pi^{L}_{P}\right),n^L_i\right).$$ Similarly, the data observed for " $R$emainers" are modelled as $$y^R_{i1},\ldots,y^R_{iP} \sim \mbox{Multinomial}\left(\left(\pi^R_{1},\ldots,\pi^R_P\right),n^R_i\right).$$
In other words, I'm assuming that within the two groups of voters, there is a vector of underlying probabilities associated with each party  ($\pi^L_p$ and $\pi^R_p$) that are pooled across the polls. $n^L_i$ and $n^R_i$ are the sample sizes of each poll for $L$ and $R$.

I used a fairly standard formulation and modelled $$\pi^L_p=\frac{\phi^L_p}{\sum_{p=1}^P \phi^L_p} \qquad \mbox{and} \qquad \pi^R_p=\frac{\phi^R_p}{\sum_{p=1}^P \phi^R_p} $$ and then $$\log \phi^j_p = \alpha_p + \beta_p j$$ with $j=0,1$ to indicate $L$ and $R$, respectively. Again, using fairly standard modelling, I fix $\alpha_1=\beta_1=0$ to ensure identifiability and then model $\alpha_2,\ldots,\alpha_P \sim \mbox{Normal}(0,\sigma_\alpha)$ and $\beta_2,\ldots,\beta_P \sim \mbox{Normal}(0,\sigma_\beta)$. 

This essentially fixes the "Tory effect" to 0 (if only I could really do that!...) and then models the effect of the other parties with respect to the baseline. Negative values for $\alpha_p$ indicate that party $p\neq 1$ is less likely to grab votes among leavers than the Tories; similarly positive values for $\beta_p$ mean that party $p \neq 1$ is more popular than the Tories among remainers. In particular, I have used some informative priors by defining the standard deviations $\sigma_\alpha=\sigma_\beta=\log(1.5)$, to mean that it is unlikely to observe massive deviations (remember that $\alpha_p$ and $\beta_p$ are defined on the log scale). 


I then use the estimated party- and EU result-specific probabilities to compute a "relative risk" with respect to the observed overall vote at the 2015 election $$\rho^j_p = \frac{\pi^j_p}{\pi^{15}_p},$$ which essentially estimates how much better (or worse) the parties are doing in comparison to the last election, among leavers and remainers. The reason I want these relative risks is because I can then distribute the information from the current polls and the EU referendum to each constituency $c=1,\ldots,C$ by estimating the predicted share of votes at the next election as the mixture $$\pi^{17}_{cp} = (1-\gamma_c)\pi^{15}_p\rho^L_p + \gamma_c \pi^{15}_p\rho^R_p,$$ where $\gamma_c$ is the observed proportion of remain voters in constituency $c$.

Finally, I can simulate the next election by ensuring that in each constituency the $\pi^{17}_{cp} $ sum to 1. I do this by drawing the vote shares as $\hat{\pi}^{17}_{cp} \sim \mbox{Dirichlet}(\pi^{17}_1,\ldots,\pi^{17}_P)$.

In the end, for each constituency I have a distribution of election results, which I can use to determine the average outcome, as well as various measures of uncertainty. So in a nutshell, this model is all about i) re-proportioning the 2015 and 2017 votes based on the polls; and ii) propagating uncertainty in the various inputs.

I'll update this model as more polls become available $-$ one extra issue then will be about discounting older polls (something like what Roberto did here and here, but I think I'll keep things easy for this). For now, I've run my model for the 5 polls I mentioned earlier and this is the (rather depressing) result.
From the current data and the modelling assumption, this looks like the Tories are indeed on course for a landslide victory $-$ my results are also kind of in line with other predictions (eg here). The model here may be flattering to the Lib Dems $-$ the polls seem to indicate almost unanimously that they will be doing very well in areas of a strong Remain persuasion, which means that the model predicts they will gain many seats, particularly where the 2015 election was won with a little margin (and often they leapfrog Labour to the first place).

The following table shows the predicted "swings" $-$ who's stealing votes from whom:

                      Conservative Green Labour Lib Dem PCY SNP
  Conservative                 325     0      0       5   0   0
  Green                          0     1      0       0   0   0
  Labour                        64     0    160       6   1   1
  Liberal Democrat               0     0      0       9   0   0
  Plaid Cymru                    0     0      0       0   3   0
  Scottish National Party        1     0      0       5   0  50
  UKIP                           1     0      0       0   0   0

Again, at the moment, bad day at the office for Labour who fails to win a single new seat, while losing over 60 to the Tories, 6 to the Lib Dems, 1 to Plaid Cymru in Wales and 1 to the SNP (which would mean Labour completely erased from Scotland). UKIP is also predicted to lose their only seat $-$ but again, this seems a likely outcome.


Thursday, 20 April 2017

Post-doc

If you fancy becoming like the crazy, purple minion, we have a Research Associated position at the UCL Institute for Global Health (with whom I've been heavily involved in the past year or so, while organising our new MSc Health Economics & Decision Science). 

All the relevant details and the link to the application form are available here. The deadline is 13 May.

Tuesday, 18 April 2017

Hope & Faith

In a remarkable and unpredictable (may be?) turn of events, the UK Prime Minister has today sort-of-called a general election for this coming June $-$ sort-of, if you don't follow UK politics, because technically a law prevents the PM to call snap elections, unless 66% of Parliament agrees to this and so there will need to be a discussion and then it will be Parliament to actually call the election...

Anyway, current polls give the ruling Conservative party way ahead with 43%. The Labour Party (who are supposed to be the main opposition, but have been in a state of chaos for quite a while now) have 25% $-$ this is compared to the results at the 2015 general election where the Tories got 37% and Labour 30%.

As if the situation weren't bleak enough for people of the left-ish persuasion (with Brexit and all), this doesn't seem to be very good news and many commentators (and perhaps even the PM herself) are predicting a very good result for the Tory, may be even a landslide.

But because of the electoral system, may be the last word has not been said: the fact is that the UK Parliament is elected on a first-pass-the-post basis and so Labour may not actually lose too many seats (as some commentators have suggested).

I went back to the official general election data and looked at the proportion of seats won by the main parties, by the size of the majority $-$ the output is in the graph below.

The story is that while there are some very marginal seats (where Labour won a tiny majority just two years ago), a 5% decrease in the vote may not be as bad as it looks $-$ although the disaffection with Labour is not necessarily uniformly distributed across the country.

More interestingly, I've also linked the data from the 2015 General Election with last year EU referendum $-$ one of the main arguments following the Brexit outcome was that the Remain camp were not able to win in Labour strongholds, particularly in the North-East of England. 

The 5 constituencies in which Labour holds a majority of less than 1% are distributed as follows, in terms of the proportion of the Remain vote:

  • Brentford and Isleworth: 0.5099906
  • City of Chester: 0.4929687
  • Ealing Central and Acton: 0.6031031
  • Wirral West: 0.5166292
  • Ynys Dulas: 0.4902312
(I know I have way too many significant figures here, but I thought it'd be interesting to actually see these values). So, apart from Ealing & Acton (strong Remain), there may be a good chance that the other four constituencies be made by people who are fed up with Labour and could be voting for some other party.

When you actually consider all the constituencies with a Labour majority of less than 10%, then the situation is like in the following graph.

Indeed, many of these are strong Leavers, which may actually be a problem for Labour. A few, on the other hand, may not be affected so much (because the "Remain" effect may counterbalance the apathy for Labour) $-$ although it may well go the other way and parties on a clear Pro-EU platform (eg the Lib Dems) may gain massively.

At the other hand of the spectrum, the corresponding graph for Conservative-hold areas with small majorities is like in the following graph.
For the Tories, the problem may be in Remain areas where they have a small majority $-$ there aren't a massive number of them, but I guess about 40% of these may be fought very hard (because they are relatively close or above 50% in terms of the proportion of Remain)?

Anyway $-$ I'm not sure whether Hope & Faith should be all smiley if you're a Pro-EU migrant. But then again, there is still some hope & faith...

Workshop on The Regression Discontinuity Design

As part of our bid to get an MRC grant (which we managed to do), we promised that, if successful, we'd also have a dissemination workshop, at the end of the project. Well, the project on the Regression Discontinuity Design (RDD) has now finished for a few months, but we're keeping our word and we have actually organised something that, as it happens, has probably turned into something slightly bigger (and better!) than intended...

As I was talking to Marcos (who's a co-director of our MSc programme, which I've mentioned for example here), we realised that the RDD is in fact a common interest of ours and so I jumped on his offer to do something together.

The plan is to have a full day on the 27th June at the Institute of Fiscal Studies in London, with the idea of mixing economists, statisticians and epidemiologists. We have a nice line up of speakers $-$ the original idea was more to show off the outputs of the project, but I think this works much better!

The registration is now open and free $-$ all the relevant information is here!

Friday, 31 March 2017

The greedy baker with a full barrel and a drunken wife

Earlier this week, I gave a talk at one of the UCL Priment seminars. The session was organised around two presentations aimed at discussing the definition and differences between "confidence" and "credible" intervals $-$ I think more generally, the point was perhaps to explore a little more the two approaches to statistical inference. Priment is a clinical trial unit and the statisticians and clinicians involved there mostly (if not almost always) use frequentist methods, but I think there was some genuine interest in the application of Bayesian inference. I enjoyed very much the event. 

For my talk, I prepared a few slides to briefly show what I think are the main differences in the two approaches, mostly in terms of the fact that while in a Bayesian context inference is performed with a "bottom-up" approach to quantify and evaluate $$\Pr(\mbox{parameters / hypotheses} \mid \mbox{observed data})$$ a frequentist setting operates in the opposite direction using a "top-down" approach to quantify and evaluate $$\Pr(\mbox{observed data / even more extreme data} \mid \mbox{hypothesised parameters}).$$
I think that went down well and at the end we got into a lively discussion. In particular, the other speaker (who was championing confidence intervals), suggested that when the two approaches produce results that are numerically equivalent, then in effect even the confidence interval can be interpreted as to mean something about where the parameter is. 

I actually disagree with that. Consider a very simple example, where a new drug is tested and the main outcome is whether patients are free from symptoms within a certain amount of time. You may have data for this in the form of $y \sim \mbox{Binomial}(\theta,n)$ and say, for the sake of argument that you've observed $y=16$ and $n=20$. So the 95% confidence interval is obtained (approximately) as $$\frac{y}{n} \pm 2\,\mbox{ese}\!\left(\frac{y}{n}\right),$$ where ese$(\cdot)$ indicates the estimated standard error for the relevant summary statistics. 

The probabilistic interpretation of this confidence interval is 
$$\Pr\left(a \leq \frac{y}{n} \leq b \mid \theta = \frac{y}{n} \right) = 0.95$$
and what it means is that if you're able to replicate the experiment under the same conditions, say 100 times, then 95 of them the resulting confidence interval would be included between the lower and upper limits $a$ and $b$. With the data at hand, this translates numerically to 
$$\Pr\left(0.6211\leq \frac{y}{n} \leq 0.9788 \mid \theta=0.80\right) = 0.95$$
and the temptation for the greedy frequentist is to interpret this as a statement about the likely location of the parameter.

The trouble with this is that by definition, a confidence interval is a probabilistic statement about sampling variability in the data $-$ not about uncertainty in the parameters as shown in the graph below.
(in this case, in fact 4 out of 100 hypothetical sample means fall outside the confidence interval limits, but this is just random variation).

The "alternative" interpretation, rescales the graph above by using the sample mean to produce the one below.

While this is perfectly valid from a mathematical point of view, however, this interpretation kind of muddies the water, I think, because it kind of confuses the point that this is about sampling variability in the data. And the data you would get to see in replications of the current experiment would be about the observed number of patients free from symptoms, $y^{new} \sim \mbox{Binomial}(\theta=0.8,n=20)$. 

Thus, it is a bit confusing, I think, to say that a Bayesian analysis could yield the same numerical results as the confidence interval $-$ perhaps a more appropriate interpretation would be that this is a probabilistic statement in the form 
$$ \Pr\left(na \leq y \leq nb \mid \theta=\frac{y}{n}\right) = 0.95, $$
which in the present case translates to 
$$ \Pr\left(12.43 \leq y \leq 19.58 \mid \theta=0.8\right) = 0.95, $$
which suggest we should expect to see a number of patients free from symptoms between 12 and 20 in replications of the current experiment (which is of course in line with the first graph shown above, on the scale of the observable data).

The Bayesian counterpart to this analysis may assume a very vague prior on $\theta$, for example $\theta \sim \mbox{Beta}(\alpha=0.1,\beta=0.1)$. This would suggest we're encoding our knowledge about the true probability of being symptoms-free in terms of a thought experiment where $\alpha+\beta=0.2$ patients were observed and only $\alpha=0.1$ were symptoms free (ie a prior estimate of 0.5 with very large variance).

Combining this with the data actually observed $y=16, n=20$, produces a posterior distribution $\theta\mid y \sim \mbox{Beta}(\alpha^*=\alpha+y, \beta^*=n+\beta-y)$, or in this case $\theta \mid y \sim \mbox{Beta}(16.01,4.1)$, as depicted below.


A 95% "credible" interval can be computed analytically and as it turns out it is $[0.6015; 0.9372]$, so indeed very close to the numerical limits obtained above when rescaling to the sample mean $y/n$ $-$ incidentally:
1. It is possible to include even less information in the prior and so get even closer to the frequentist analysis;
2. However, even in the presence of large data that probably will overwhelm the prior anyway, is it necessary to be soooo vague? Can't we do a bit better and introduce whatever prior knowledge we have (which probably will separate out the two analyses)?
3. I don't really like (personally) the term "credible" interval. I think it's actually un-helpful and it would suffice to call this an "interval estimate", which would highlight the fact that given the full posterior distribution, then all sorts of summaries could be taken.

The Bayesian analysis tells us something about the unobservable population parameter given the data we've actually observed. We could predict new instances of the experiment too, by computing
$$ p(y^{new}\mid \theta,y) = \int p(y^{new}\mid \theta)p(\theta\mid y) \mbox{d}\theta. $$
This means assuming that the new data have the same data generating process as those already observed and that we propagate the revised, current uncertainty in the parameter to the new data collection. In some sense, this is even the "true" objective of a Bayesian analysis, because parameters typically don't "exist" and they are just a convenient mathematical feature we use to model observable data.

The confidence interval can make probabilistic statements about the likely range of observable data in similar experiments that we may get to see in the future. The Bayesian analysis tells us something about the likely range of the population parameter, given the evidence observed and our prior knowledge, formally included in the analysis. 

While I think the construction of the confidence interval has some meaning when interpreted in terms of replications of the experiment (particularly with respect to the actual data!), I believe that the stretch to the interpretation as a probability for the parameter is a unwarranted, hence the reference to the greedy baker who wants to have and eat his cake $-$ which in Italian would translate as having a full barrel (presumably of wine) and a drunken wife...

So I guess that, in conclusion, if you enjoy a tipsy partner, you should be Bayesian about it...

Tuesday, 21 February 2017

Coming soon!

We've just received a picture of the cover of the BCEA book, which is really, really close to being finally published!

I did mention this in a few other posts (for example here and here) and it has been in fact a rather long process, so much so that I have made a point of joking in my talks about BCEA that we'd be publishing the book in 2036 $-$ so we're in fact nearly 20 years early...

The official book website is actually already online and I'll prepare another one (I know, I know $-$ it's empty for now!), where we'll put all the relevant material (examples, code, etc).

I think this may be very helpful for practitioners and our ambition is to make the use of BCEA as wide as possible among health economists $-$ even those who do not currently use R. The final chapter of the book also presents our BCEAweb application, which can do (almost everything!) that the actual package can (the nice thing about it is that the computational engine is stored on a remote server and so the user does not even need to have R installed on their machine).

We'll probably have to make this part of the next edition of our summer school...

Thursday, 26 January 2017

Three rooms left...


Last December, Kobi and his classmates did their Christmas play, which was based on a relatively close representation of the Nativity (well $-$ perhaps back then shepherds used to run around with most of their hands up their nose, waiving at their parents too...).

Anyway, one of the top acts of the whole thing was something like this, hence the title of the post.

But, more importantly, we're almost running out of single rooms for our summer school, later this year, in Florence (although there are more double rooms)! So book your space soon!

Monday, 23 January 2017

Face value

This is actually a not-so-recent paper, but I've only discovered now and I think it's very interesting. The underlying issue is about trying to do "causal inference" from observational data $-$ perhaps one could see this in a simpler way by considering the idea of "balancing" observational data, to mimic as far as possible an experimental setting (and so be able to estimate "causal" effects). [There's lot more on the philosophical aspects behind this problem, which I'm conveniently swiping under the carpet, here...]

Anyway, one of the most popular ways of dealing with this issue of unbalanced background covariates (or generally, confounding) is to use propensity score matching. But, while I think that the idea is somewhat neat and clearly important, what has always bothered me (among other things) is the fact that the resulting outcome model does assume that the estimate of the propensity score (PS) is "perfect" $-$ known with absolute precision, although the basic assumption is that "the PS model needs to be correct". But of course, there's no way of knowing perfectly that the PS model is correct...

So the idea of joining model selection and propagation of uncertainty through the outcome model is actually very interesting. I've only flipped through the paper and I did have some very preliminary ideas in mind on this, so I really want to have a proper look at this!