Saturday, 19 December 2015

Political Forecasting Machine - The Spanish Edition

This is a (rather long, but I think equally interesting!) guest post by Roberto (he's introducing himself below). We had already done some work on similar models a while back and he got so into this that now he wanted to actually take on the new version of Spanish Inquisition (aka general election). I'll put my own comments on his original text below between square brackets (like [this]).

My name is Roberto Cerina, I am a Masters student under Gianluca's supervision here at UCL, and this is a guest post on work we've been doing together.

A year after "shaking-up" the status-quo of pollsters and political pundits with our fearless forecasting of the US 2014 Senate election, the Greatest Political Forecasting Machine the world has ever seen is back with a vengeance. This time, we take on the Spanish Armada. The juicy results come after the "Model" section, so feel free to jump right there if you can't contain your enthusiasm for this forecast. Apologies for all and any errors in this work, for which I take full responsibility for.
[Actually, I guess I should be taking more responsibility in those cases. But given he's volunteered, I'll let Roberto do this! ;-)]

 Pablo Iglesias, leader of the anti-austerity party "Podemos"

Intro:
The Spanish Election which will take place this coming Sunday (December 20th) is a whole different kind of challenge compared to our experience with the US. To start with, we are talking about a multi-party system, which has seen over 100 different political entities inhabiting the "Congreso de los Diputados" (the lower house of the Spanish Parliament, which is the focus of our forecast); furthermore, the Kingdom has only been a democracy since 1977 and has morphed into its classical form of PP (Partido Popular, centre-right) vs PSOE (Partido Socialista Obrero Español, centre-left) as late as 1989, giving very few past elections to base our results on. To complicate matters further, this 2015 elections is hardly exchangeable with the previous ones, involving two new parties, Ciudadanos (C) and Podemos (P), which are polling at around 20% each. Finally, strong regional identities lead to territorial political parties which do not fall into any specific left or right category, and whilst they perform well within their specific constituency, they are largely irrelevant on the national stage.

In order to tackle this beast, we use data from the Global Election Database, giving us access to constituency-level past election results; economic data from quandl and the Spanish Instituto National de Estadistica; historical government approval rating from the Centro de Investigaciones Sociologicas; seat and vote share polls from the (very, very useful) tailor-made wikipedia page.

The model:
We capture the long-term dynamics of the Spanish election through a Discrete Choice Model based on a Multinomial Logit Regression, estimated in a Bayesian fashion. We recognise this does not solve the "Independence of Irrelevant Alternatives” (a property automatically encoded in the multinomial-logit model which suggests that the entry of a new party in the race leave the relationship between the other parties unchanged). We look forward to modify this historical model, perhaps using Multinomial Probit or Nested Logit, in the next iteration of our forecast.

The parties which we examine throughout are those that compete Nationally, whilst the "Other" label captures regional entities and very low performing national parties. The parties examined are: Partido Popular (PP); Partido Socialista Obrero Español (PSOE); Izquierda Unida (IU) and Others. This Long Run model does not include estimates for the new parties, C and P. This model is at the province level, and it is simple to aggregate the results to get a national estimate of the vote shares. Having a province-level model allows us to keep flexibility with respect to using provincial polls if they become available. The variables which we have used for this particular example are: province-level GDP; National approval rating of government and national incumbency of party. The model is essentially a version of the famous "Time for Change" developed by Alan Abramowitz, which is a good starting point for election dynamics, although it fails to be a good predictor for non-governing parties in multi-party races. We hope to cure this ill by introducing a party-province random effect to account for significant "party strongholds" effect.

The Discrete Choice Model is based on the idea that rational voters gain utility from voting for one party or another. We model their utility as a function of the observed utility (based on the "Time for Change" economic vote model), and unobserved utility distributed through a Gumbel distribution, hence encoding a Multinomial Logit model.
$$U_{ikt} = H_{ikt} + e_{ikt}, \mbox{ with } e_{ikt} \sim\mbox{Gumbel}(\mu,\beta).$$
The probability that an individual voter (or an aggregate of voters in our case) will vote for party i in province k at time t is dependant on whether party i guarantees the individual more utility than the other parties.
$$P_{ikt} = \mbox{Pr}(U_{ikt}>U_{jkt}) = \mbox{Pr}(V_{ikt} + e_{ikt} > V_{jkt}+e_{jkt}) = \mbox{Pr}(e_{jkt}<e_{ikt}+V_{ikt}-V_{jkt}) \forall j\neq i.$$After a short and painless derivation it is possible to see that the probability of interest is equal to the logit of the observed utility:
$$\mbox{logit}(P_{ikt}) =H_{ikt}.$$[I like the idea of a "painless" algebraic derivation $-$ I certainly didn't know any such thing, when I was a student...]

The historical forecast is a hierarchical function of National ($x$) and Provincial ($z$) variables, and all the relevant coefficients are assigned vague priors:
$$H_{ikt}= \alpha_{ik} + \sum_a \beta_{ik} x_{akt} + \sum_b \zeta_ {ik}z_{bkt}$$
After deriving the long run vote share probabilities, we need to produce a forecast for 2015 which includes the new parties running. We want a National forecast, so we aggregate the province forecasts. Then we assign reasonable vague priors to the expected vote shares of C and P, and re-weight our previous estimates to account for the presence of these. Here we assume that C and P steal votes equally from all parties, something that is probably overly simplistic.
$$\mbox{P}_{C,t}, \mbox{P}_{P,t} \stackrel{iid}\sim \mbox{Uniform}(0.1,0.3)$$We then re-weight the national long run estimates, after we assign C and P uniform priors between 10% and 30%, measures determined by understanding that a) a party winning 30% of the vote in this election would essentially win, and neither party looks poised to come first; b) that both parties polled well above 10% as early as February, and hence there would have been no chance of either falling through this bracket.

We use a Dynamic Bayesian model to update the Long Run estimates (for each party) obtained above, with observed opinion polls. The multitude of polls available allows us to follow the campaign for 51 weeks, meaning every week since the beginning of 2015. We connect weeks together through a reverse random walk pinned on election week at the Long Run estimates, which enables us to make inference over weeks were no polls have been released. Every week we pool all opinion polls released during the course of that week and feed them to the model, which then updates its estimates for the predicted vote shares of every party, for every week in the campaign, in a Bayesian fashion.

• $\mbox{Y}_{il} \sim \mbox{Multinomial}(\mbox{N}_{l},\theta_{1l},..., \theta_{nl})$
• $\theta_{il}=\mbox{logit}^{-1} (v_{il} + m_{i}),$
• $v_{iL} \sim \mbox{Norm} \left( \mbox{logit}(\mbox{P}_{iT}), \frac{1}{\tau_{hist}}\right)$
• $v_{il} \mid v_{il} \sim (v_{il},\sigma_v)$
• $m_{i} \sim \mbox{Norm} \left( 0, \frac{1}{\tau_{mi}}\right)$
In the equations above, we have $I=1,\ldots,L$ campaing weeks; $i=1,\ldots,n$ parties; $Y_{ik}$ is the number of voters expressing a preference for party $i$ at week $l$ of the campaign; $N_k$ is the number of voters polled over week $l$; $P_{iT}$ is the predicted vote share for the election year at hand $T$, derived from the Long-Run model; $v_{il}$ is a party-week effect; and $m_i$ is a party effect capturing the remaining unobserved party-specific variability, during the campaign. The precision parameter $\tau_{hist}$ represents the confidence we have in our long run model estimates as it regards this election at hand, and it is to be calibrated through a sensitivity analysis.

We get an initial seat projection by exploiting the historical correlation between votes and seats. We produce two different regressions, one for "Governing Parties", the other for "Non-Governing parties", and use the former to estimate the seats shares for IU, Other, C and P. The latter estimates the seats shares for PP and PSOE. The estimates are then pooled together and re-weighted in order to make sure they add up to the full 350 seats in the chamber of deputies.

The final seat projections come from another Dynamic Bayesian model, which makes use of the available seat projection polls (which are usually not as freely available) since the beginning of 2015. Again, we connect weeks together through a reverse random walk pinned on election week at the Seat Projections derived above (equations omitted as almost identical to the above Dynamic Bayesian Model).

Results:
We first produce our long run estimates, and derive a measure to assess their reliability over time. Our Long-Run dynamics estimates for the national vote shares are as follows (to two decimal places):
How reliable are these predictions? A Root Mean Squared Error of 0.0529 (4dp) over past elections since 1989, suggests that the model misses the correct vote shares of the 5 parties by 5.3% on average. This makes for a fairly reliable model which captures most of the historical dynamics behind the election. The point estimates of the model provide a mixed bag of results with respect to pointing to the correct winner of the election, although this is more a reflection of the competitiveness of Spanish politics (shown by overlapping prediction intervals for governing parties) than a weakness of the model itself.
We then proceed to update these estimates with the available polls. The final vote share estimates are also displayed below. According to these estimates, the PP wins the largest percentage of votes, followed by the PSOE and Podemos and C. Other parties and IU win predictable percentages. The model provides strong confidence in these estimates, with standard deviations lower than half of a percentage point. These results are more or less in line with the long-run model, with the exception that the PSOE is underestimated according to the long-run dynamics. However both history and polls conclude that Mariano Raoy’s Popular Party is set to gain the largest vote share.
The Dynamic Bayesian model allows us to make inference as it relates to the campaign behaviour of voters, which is displayed in the following plots. For this post we only fit the model for election week due to computational constraints, and the last polls embedded in the model are from December 16th (although the official polling deadline wad December 14th, some "Illegal Polls" from credible papers have been published since). We should point out the the validity of inference made from this model, and especially as it pertains to voter behaviour during the campaign, is proportional to the validity of the polls as a tool for monitoring the behaviour of votes. Some pollsters are better than others, and in future iterations of this model we will investigate polling firms further and use a weighted average of polls rather than a simple average.
The dotted line in the plots points to the historical estimate of the vote share. We can see how the PSOE over-performs the historical estimate throughout the campaign, whilst the PP dances around it, outperforms it form week 28 to week 47 of the campaign, and eventually converges to its historical estimate. The vote shares for these governing parties are rather stable throughout the campaign. A very different scenario unfolds when we look at the "young guns" C and P.
The two new parties start from opposite ends of the spectrum: C is a former regional, Catalonia-based anti-separatist party, which was largely unknown by the wider public at the beginning of this campaign. Podemos, on the other hand, came off the back of a successful European Election which helped put it on the map, but also strong international recognition as Spain’s anti-austerity answer to the perceived dictates of the European Union. Furthermore, the electoral victories of Syriza (the Greek "equivalent" of Podemos), and the prominence of the discussion on austerity and the EU, essentially allowed Podemos to monopolise the debate at the early stages of 2015. As the campaign went on we can see how Podemos fell short of its initial goals, and lost as high as 15% of its popular vote share, before picking back up to a more reasonable 20%, in the last few weeks of the campaign. C follows essentially the exact opposite path, increasing its share as Podemos’ falls, and dropping back to around 17% of the total vote share as Podemos pushed back in the last week. This suggests there is a high degree of substitutability between the parties, which is odd considering C is a centre-right party and Podemos is very far left.

More investigation on the reasons for this shared electorate is in order, however a simple explanation could just be that they represent some new ideas in Spanish politics. Given that someone decides to vote for a "New", before the rise of C, he/she could only vote for Podemos, perhaps even if the voters’ ideas were in contrast with those of the party, as a sign of protest. However, the rise of C allowed those disaffected with the current political system but still of the same ideological spectrum as the current (right wing) governing party, to find a new home. This is just speculation for now, but perhaps it could explain at least some of the exchangeability between the two parties. IU and Other parties maintain a fairly invariant percentage of the vote, with the exception of the last few weeks for Other parties. This could be as a result of the of the parties included in this category being correlated with C or Podemos’ variability.

We then go ahead and try to provide Seats Projections for the Congress. The difficulty here lies in that, not having at our disposal regional polling, it is not possible for us to determine the result of every single race.
Furthermore, the relationship between national vote share and seats is not 1:1, since Spain uses the D'Hondt system of seat allocation, which is a complex mechanism meant to strike the right balance between representation and governability. However, this system has its weaknesses, ie that it over-represents large parties, especially in rural areas, whilst penalising smaller ones. This simple conceptual difference is enough for us to justify the use of two different projections for governing and non-governing parties, as explained above. Hence, we pool the seats won (historically) by the parties into these two categories, do the same for the votes and then regress seats on votes for both groups. The results of these projections are then aggregated and re-weighted providing us with the following estimates based solely on the correlation between votes and seats:

We then proceed to update these with seat projection polling: the results of this effort are below, along with the plots showing how potential seats shares have evolved during the campaign. Finally, seats are projected on the actual parliament, and compared to the parliament composition we were left with in 2011, from which we can make inference as to the total change in seats.

Variability in seats is very high generally, but we experience many of the same patterns that we see at it regards to vote share during the campaign. PP and PSOE are relatively stable in their estimates, and especially in the last few weeks of the campaign, they don’t seem to vary much at all. The same can be said for IU, which is even more remarkably stable throughout, whilst Other parties experience a slight U-shape trend in the last few weeks of the campaign, but stay within their overall campaign average interval. C and Podemos are again remarkably variable with Podemos setting itself up to be potentially the second largest party at the beginning of 2015, but then falling dramatically in conjunction with C’s rise. C’s momentum was very strong in the second-half of the campaign, but flattened out in November and December, something that coincided with Podemos’ re-gaining strength.

In conclusion, we expect the congress post-2015 election to look extremely fragmented. No party will outright win a majority and it will be a power-play to see who will manage to form a government. Mariano Rajoy is likely to stay prime minister
[I guess it's: sorry, Spain...?]
but should he find himself incapable of forming a government by brokering deals with some of the other parties, we may see new and unpredictable scenarios unfolding. The powerful show and entry in parliament by Ciudadanos and Podemos raises a question: are we seeing a gradual but definite systematic renewal of Spanish politics or is this predicted strong performance by new parties due to the exceptional circumstances we live in (austerity, unemployment, EU centralisation, etc)? We cannot answer that, as of today, but we can be sure that the answer depends almost entirely on how much "change" these parties are going to be able to bring and, crucially, how quickly. They may find that even a politically adventurous electorate such as the Spaniards will have very little patience, when it comes to broken promises.

[Just to add that another interesting forecast of the Spanish election has been done by Virgilio]