The number of babies called Kian in Ireland increased sharply after the band Westlife became popular. Is this relationship coincidental? Or is there some deeper meaning?

The Central Statistics Office in Ireland provide various data sets on birth rates and names. Pairing this data with Westlife's single release history leads to the interesting pattern below, where it looks like the popularity of the name Kian in Ireland took off shortly after Westlife became the international megastars we know and love today.

We can see this even more clearly if we lag the singles data by two years and use a scatter plot matrix:

While it's tempting to say that Westlife caused Kian to become popular, we have to be careful not to wander into statistical traps. Correlation does not imply causation. However, we can be a little more specific.

## The Granger causality test

The Granger causality test is a statistical test that can be used to determine whether one time series is a good predictor of another. More specifically, if we test to see if A is Granger-caused by B, and the test passes, we can then state that the past values of both A and B are a better predictor of A than the past values of A alone. This is not as strong a statement as saying that A is caused by B[1], but it does go further than simply stating that A and B are correlated.

The Granger causality test works by building two multiple regression models of the target time series. The first model is constructed from the past values of the target time series alone, while the second is built using the past values of the target series and the past values of another "candidate" series, for which we are testing the Granger causality hypothesis. Various hypothesis tests can be performed to determine if a Granger causal relationship exists. Generally, the relationship is declared to exist if we reject the null hypothesis at a certain significance level (e.g. p < 0.05).

In cases where there is a structural break in the data, we can either introduce dummy variables to the regression or just check for the presence of a causal relationship on either side of the break separately.

In our data, there is a structural break around 1999, when Westlife first began to release singles. Before this point, the Westlife single series has all-zero values, so it doesn't make sense to test for Granger causality here (an all zero vector has no predictive power in a regression), but we can do the check from 1999 onwards, up to a maximum of four lags:

Granger Causality
('number of lags (no zero)', 1)
ssr based F test:         F=5.6379  , p=0.0351  , df_denom=12, df_num=1
ssr based chi2 test:   chi2=7.0474  , p=0.0079  , df=1
likelihood ratio test: chi2=5.7772  , p=0.0162  , df=1
parameter F test:         F=5.6379  , p=0.0351  , df_denom=12, df_num=1

Granger Causality
('number of lags (no zero)', 2)
ssr based F test:         F=5.8490  , p=0.0236  , df_denom=9, df_num=2
ssr based chi2 test:   chi2=18.1970 , p=0.0001  , df=2
likelihood ratio test: chi2=11.6594 , p=0.0029  , df=2
parameter F test:         F=5.8490  , p=0.0236  , df_denom=9, df_num=2

Granger Causality
('number of lags (no zero)', 3)
ssr based F test:         F=1.2522  , p=0.3713  , df_denom=6, df_num=3
ssr based chi2 test:   chi2=8.1395  , p=0.0432  , df=3
likelihood ratio test: chi2=6.3205  , p=0.0970  , df=3
parameter F test:         F=1.2522  , p=0.3713  , df_denom=6, df_num=3

Granger Causality
('number of lags (no zero)', 4)
ssr based F test:         F=0.7840  , p=0.6039  , df_denom=3, df_num=4
ssr based chi2 test:   chi2=12.5447 , p=0.0137  , df=4
likelihood ratio test: chi2=8.5871  , p=0.0723  , df=4
parameter F test:         F=0.7840  , p=0.6039  , df_denom=3, df_num=4


The test results show that a Granger causal relationship exists for lags of one and two years, at significance level of p < 0.05, although not for higher level lags. This effectively confirms our earlier suspicion: that Westlife singles have significant predictive power in forecasting the popularity of the name Kian.

## Predicting popularity

Now that we've confirmed that there's a Granger causal relationship in the data, we can exploit[2] it to predict the popularity of Kian over time:

As you can see, the model predicts the popularity of Kian quite well. However, if we exclude all the past popularity data from the predictor variables, something interesting happens:

While this model is a little less accurate than the first, it doesn't rely on the past popularity of the name Kian when predicting its future popularity. Instead, it accurately predicts the popularity of the name based on the number of Westlife singles released in a given year alone.

## Conclusion

Using the Granger causality test and a multiple linear regression, it's possible to build an accurate model that explains the current popularity of the name Kian using the number of Westlife singles released in the past two years.

The popularity of the names of other Westlife members (Bryan/Brian, Mark, Nicky and Shane) are not as straightforward to model, as they (or their most common variants) were never as unpopular as the name Kian prior to Westlife releasing their first single.

As Westlife last released a single in 2011, the current model will likely fail in 2015 and beyond, although it will be interesting to see whether future deviations from its predictions can be explained by Kian Egan's solo releases, or taking into account the chart positions of singles, which are a more intuitive indicator of popularity than the raw number of singles released in a given year.

1. For instance, A and B could both be caused by a third factor, C. ↩︎

2. All the code and data for this is available on GitHub. ↩︎