Lawrence R. De Geest

Pulling together: variance tests for panel data

A new way to study cooperation.

Manchester United, once all-conquering in the Premier League, have fallen on relatively hard times, and this season piled on the misery (or comedy, depending on your allegiances). They dropped out of the title pack early on, sacked yet another illustrious manager, and on the last game of the season they lost at home to a relegated team.

I watched the game in a North End cafe, and after the final whistle I chatted with a diehard fan. He said you had to accept that some years are feast and others famine, even with a team of superstars collecting eye-watering salaries. But what frustrated him most was the variation in effort. Week in and week out, some players ran themselves into the ground, chasing the ball, or yelling at the referee, while others seemed to just walk around, seeming not to care whether they won or lost.

Sport is a lot about tallying things up – wins, goals, hairstyles – but it’s also about seeing people behave according to lofty ideals. This fan pulled an imaginary rope and said it would be easier to swallow a bad season if he could see players who always pulled together. He was not alone: on the television, fans in Old Trafford, the team’s stadium, enveloped the tunnel to shake fists at the players.

Understanding cooperation through variance

You can learn a thing or two about cooperation watching the Premier League, but you can learn more watching people in experiments designed to study cooperation under different scenarios. By now there is a library of these papers about how people cooperate in teams, or firms, or countries, or whatever. Many follow a basic framework. You put people into groups and they play version of a prisoner’s dilemma: everybody wins when they pull together, but each individual person wins more when they slack off and let others do the work. Usually, pulling together means taking an endowment of tokens given to you by the experimenter and putting them into a shared pot, and slacking off means keeping your tokens for yourself.

Standard protocol to measure cooperation is to look at how many tokens on average does a group put in the pot. But the average is like the scoreline at the end of a game: maybe the team won, but only thanks to the effort of one player, not because everybody pulled together. Like our diehard fan suggested, another way to study cooperation is to check whether the players really are pulling together – check the variation in behavior. If everybody is on the same page, then they must doing the same thing, so their behavior will not be statistically different.

Experiments have many groups of people playing these games, and you want to get a sense of the whole – of any systematic patterns when you put all the groups together. So the thing to do is to see whether the variation within each group is different between groups.

Variance tests and twins

The canonical test for this problem is Levene’s test for equality of variances. Consider the null hypothesis that all groups have the same variance. That is, $H_{0}:\sigma_{1}^{2}=\sigma_{2}^{2}=…=\sigma_{j}^{2}$, where $j$ indexes groups. Levene’s test calculates absolute deviations from the group mean by individual $i$ as

\[z_{ij}=|y_{ij}-\bar{{\mu}_{j}|}\]

and then calculates the one-way F-test

\[F(\mathbf{{z})}=ANOVA(z_{ij})\]

where

\[F(\mathbf{{z}})\sim F(k-1,n-k).\]

However, both the Levene and Brown-Forsynthe tests assume observations within a group are independent. This is where things gets a bit trickier. In a group of people trying to cooperate, whether one person cooperates will depend on whether another person cooperates, and so on. Even though observations between groups are independent (since each one is like a separate experiment), observations within a group are correlated.

Iachine et al. (2010) and Soave and Sun (2017) consider the problem of variance tests for dependent observations in the context of twin data. A pair of twins share genetic information and usually grow up in similar environments, so continuous measures (e.g. height) are correlated. Twins can be identical (as in genetically identical) or fraternal, but the literature suggests the spread of height between pairs of identical or fraternal twins is the same. In other words, in a sample of identical and fraternal twins, the variances in heights between twins should be equal. But you can’t test this hypothesis using the Levene and Brown-Forsynthe tests. Since the variance within each pair is correlated, those tests will make too many false positives to be reliable.

Fortunately, there is a workaround. Instead of comparing raw variances, you need to compare “residualized variances” – the variance left behind after you take out the variance due to the correlation within twins. Both papers show that you can make this comparison by re-framing the variance test as a regression. Specifically, you replace the one-way $ANOVA$ with a two-step regression with clustered standard errors to get the residualized variance, and then do a Wald-type test to carry out the variance test.

How I came to know all this is thanks to a comment from Brock Stoddard on a paper I wrote with John Stranlund. We studied how people cooperate when other people can steal their stuff (picture a group of fishermen who share a fishery and must fend off poachers). One interesting thing about the paper is that we had a treatment with outsiders (the ones who could steal stuff) and one treatment without. Long story short, the threat of outsiders had an effect on average insider cooperation (positive or negative depended on the payoff function, another one of our treatments). But we also wanted to look at how the threat affected not just not just the expected value of cooperation, but also the spread or variance. When outsiders lurked, did insiders pull together or pull apart?

Brock pointed out we couldn’t use the canonical tests because they assume independent observations. But we also couldn’t use the tests described by Iachine et al. (2010) and Soave and Sun (2017). Like most other cooperation experiments, we had individuals in groups making decisions over time. So observations weren’t just correlated within a period, they were also correlated across periods, meaning we had to account for both cross-sectional and serial correlation. Lucky for us it was pretty easy to do.

A modified Levene’s test for panel data

Let $x_{j}$ be an indicator for membership in group $j$ and let $y_{ijt}$ be an observation from individual $i$ in the period $t$. (For the sake of simplicity I’ll drop treatment subscripts, but this test can be modified to test for differences in variation between treatments.) The null hypothesis is $H_{0}:\sigma_{1}^{2}=\sigma_{2}^{2}=…=\sigma_{j}^{2}$. The new test proceeds in three steps:

  1. Estimate the random effects GLS model

    \[y_{ijt}=\mathbf{X}'\beta+\nu_{t}+\mu_{i}+\epsilon_{ijt},\]

    where $\mathbf{X}$ is a matrix of group dummies, $\nu_{t}$ are period fixed-effects (more dummies), $\mu_{i}$ is the subject random error and $\epsilon_{ijt}$ is the idiosyncratic error.

  2. Obtain the residuals \(\hat{\epsilon}_{ijt}\) and calculate

    \[z_{ijt}=|\hat{\epsilon}_{ijt}|\]
  3. Finally, estimate the random effects GLS model

    \[z_{ijt}=\mathbf{{X}'\beta}+\mu_{i}+\epsilon_{ijt}.\]

    This results in a Wald-test that uses the cluster-robust variance-covariance matrix from the previous steps, spitting out a test statistic distributed $\chi_{(j-1)}^{2}/(n-j)$. The usual interpretation applies. If the test statistic is large and its p-value is small, then the null hypothesis is rejected.

Code

All this is easy on a computer1. In Stata it looks something like:

1
2
3
4
5
6
xtset subject period 
quietly xtreg y i.group i.period, re cluster(group)
predict residuals, e
gen d = abs(residuals)
quietly xtreg d i.group, re
display "Chi2 = " e(chi2) " "  "p-val = " e(p) " 

Since you only need the Chi-squared test on the last model, you can suppress the regression output with quietly and then just print or display the Chi-squared test to the screen.

The nice thing about this set-up is that it fits nicely into a loop. So, if you have $n$ treatments (called treatment), you can run the test for each treatment:

1
2
3
4
5
6
7
8
9
10
11
12
xtset subject period 
levelsof treatment, treatments
foreach i in `treatments' {
	preserve
	keep if treatment == `i'
	quietly xtreg y i.group i.period, re cluster(group)
	predict residuals, e
	gen d = abs(residuals)
	quietly xtreg d i.group, re
	display "Chi2 = " e(chi2) " "  "p-val = " e(p) " " 
	restore
}

Conclusion

Sometimes the mean is meaningful, but sometimes it obscures more than it reveals. So it helps to look at the variance, too. You can use this test whenever you want to look at the spread of cooperation, or any kind of behavior, in groups where people’s decisions (and outcomes) correlate with each other and over time. Maybe it will even help the coaching staff at Manchester United (but I hope not: I’m an Arsenal fan.)

References

De Geest, Lawrence R., and John K. Stranlund. “Defending public goods and common-pool resources.” Journal of Behavioral and Experimental Economics 79 (2019): 143-154. [Link]

Gastwirth, Joseph L., Yulia R. Gel, and Weiwen Miao. “The impact of Levene’s test of equality of variances on statistical theory and practice.” Statistical Science 24, no. 3 (2009): 343-360. [Link]

Iachine, Ivan, Hans Chr Petersen, and Kirsten O. Kyvik. “Robust tests for the equality of variances for clustered data.” Journal of Statistical Computation and Simulation 80, no. 4 (2010): 365-377. [Link]

Soave, David, and Lei Sun. “A generalized Levene’s scale test for variance heterogeneity in the presence of sample correlation and group uncertainty.” Biometrics 73, no. 3 (2017): 960-971. [Link]

  1. The full code for our paper is on GitHub, and the part where we run this test is on line 106 of models.do