Loading

- No to inferential thresholds
- Spatial models for demographic trends?
- Graphics software is not a tool that makes your graphs for you. Graphics software is a tool that allows you to make your graphs.
- Tips when conveying your research to policymakers and the news media
- Computing marginal likelihoods in Stan, from Quentin Gronau and E. J. Wagenmakers
- My talk tomorrow (Fri) 10am at Columbia
- No no no no no on “The oldest human lived to 122. Why no person will likely break her record.”
- 3 more articles (by others) on statistical aspects of the replication crisis
- “What is a sandpit?”
- High five: “Now if it is from 2010, I think we can make all sorts of assumptions about the statistical methods without even looking.”
- I hate that “Iron Law” thing
- Fitting multilevel models when predictors and group effects correlate
- What should this student do? His bosses want him to p-hack and they don’t even know it!
- Stan Roundup, 10 November 2017
- Noisy, heterogeneous data scoured from diverse sources make his metanalyses stronger.
- “A mixed economy is not an economic abomination or even a regrettably unavoidable political necessity but a natural absorbing state,” and other notes on “Whither Science?” by Danko Antolovic
- Using D&D to reduce ethnic prejudice
- When people proudly take ridiculous positions
- Using Stan to improve rice yields
- The Statistical Crisis in Science—and How to Move Forward (my talk next Monday 6pm at Columbia)
- Why you can’t simply estimate the hot hand using regression
- Planet of the hominids? We wanna see this exposition.
- Why won’t you cheat with me?
- The Night Riders
- The time reversal heuristic (priming and voting edition)
- Post-publication review succeeds again! (Two-lines edition.)
- Pseudoscience and the left/right whiplash
- StanCon2018 Early Registration ends Nov 10
- More thoughts on that “What percent of Americans would you say are gay or lesbian?” survey
- The king must die
- Looking for data on speed and traffic accidents—and other examples of data that can be fit by nonlinear models
- Statistical Significance and the Dichotomization of Evidence (McShane and Gal’s paper, with discussions by Berry, Briggs, Gelman and Carlin, and Laber and Shedden)
- What I missed on fixed effects (plural).
- “Americans Greatly Overestimate Percent Gay, Lesbian in U.S.”
- Using Mister P to get population estimates from respondent driven sampling
- Whipsaw
- Contour as a verb
- “Quality control” (rather than “hypothesis testing” or “inference” or “discovery”) as a better metaphor for the statistical processes of science
- Advice for science writers!
- My favorite definition of statistical significance
- An alternative to the superplot
- Science funding and political ideology
- Stan Roundup, 27 October 2017
- Quick Money
- If you want to know about basketball, who ya gonna trust, a mountain of p-values . . . or that poseur Phil Jackson??
- In the open-source software world, bug reports are welcome. In the science publication world, bug reports are resisted, opposed, buried.
- This Friday at noon, join this online colloquium on replication and reproducibility, featuring experts in economics, statistics, and psychology!
- I think it’s great to have your work criticized by strangers online.
- My 2 talks in Seattle this Wed and Thurs: “The Statistical Crisis in Science” and “Bayesian Workflow”
- The Publicity Factory: How even serious research gets exaggerated by the process of scientific publication and reporting

Harry Crane points us to this new paper, "Why 'Redefining Statistical Significance' Will Not Improve Reproducibility and Could Make the Replication Crisis Worse," and writes:

Quick summary: Benjamin et al. claim that FPR would improve by factors greater than 2 and replication rates would double under their plan. That analysis ignores the existence and impact of "P-hacking" on reproducibility. My analysis accounts for P-hacking and shows that FPR and reproducibility would improve by much smaller margins and quite possibly could decline depending on some other factors. I am not putting forward a specific counterproposal here. I am instead examining the argument in favor of redefining statistical significance in the original Benjamin et al. paper.From the concluding section of Crane's paper:

The proposal to redefine statistical significance is severely flawed, presented under false pretenses, supported by a misleading analysis, and should not be adopted. Defenders of the proposal will inevitably criticize this conclusion as “perpetuating the status quo,” as one of them already has [12]. Such a rebuttal is in keeping with the spiritof the original RSS [redefining statistical significance] proposal, which has attained legitimacy not by coherent reasoning or compelling evidence but rather by appealing to the authority and number of its 72 authors. The RSS proposal is just the latest in a long line of recommendations aimed at resolving the crisis while perpetuating the cult of statistical significance [22] and propping up the flailing and failing scientific establishment under which the crisis has thrived.I like Crane's style. I can't say that I tried to follow the details, because his paper is all about false positive rates, and I think that whole false positive thing is a inappropriate in most science and engineering contexts that I've seen, as I've written many times (see, for example, here and here). I think the original sin of all these methods is the attempt to get certainty or near-certainty from noisy data. These thresholds are bad news--and, as Hal Stern and I wrote awhile ago, it's not just because of the 0.049 or 0.051 thing. Remember this: a z-score of 3 gives you a (two-sided) p-value of 0.003, and a z-score of 1 gives you a p-value of 0.32. One of these is super significant--"p less than 0.005"! Wow!--and the other is the ultimate statistical nothingburger. But if you have two different studies, and one gives p=0.003 and the other gives p=0.32, the difference between them is not at all remarkable. You could easily get both these results from the exact same underlying result, based on nothing but sampling variation, or measurement error, or whatever. So, scientists and statisticians: All that thresholding you're doing? It's not doing what you think it's doing. It's just a magnification of noise. So I'm not really inclined to follow the details of Crane's argument regarding false positive rates etc., but I'm supportive of his general attitude that thresholds are a joke. POST-PUBLICATION REVIEW, NOT "EVER EXPANDING REGULATION" Crane's article also includes this bit:

While I am sympathetic to the sentiment prompting the various responses to RSS [1, 11, 15, 20], I am not optimistic that the problem can be addressed by ever expanding scientific regulation in the form of proposals and counterproposals advocating for pre-registered studies, banned methods, better study design, or generic ‘calls to action’. Those calling for bigger and better scientific regulations ought not forget that another regulation--the 5% significance level--lies at the heart of the crisis.As a coauthor of one of the cited papers ([15], to be precise), let me clarify that we are _not_ "calling for bigger and better scientific regulations, nor are we advocating for pre-registered studies (although we do believe such studies have their place), nor are we proposing to "ban" anything!, nor are we offering any "generic calls to action." Of all the things on that list, the only thing we're suggesting is "better study design"--and our suggestions for better study design are in no way a call for "ever expanding scientific regulation." The post No to inferential thresholds appeared first on Statistical Modeling, Causal Inference, and Social Science.

Jon Minton writes:

You may be interested in a commentary piece I wrote early this year, which was published recently in the International Journal of Epidemiology, where I discuss your work on identifying an aggregation bias in one of the key figures in Case & Deaton's (in)famous 2015 paper on rising morbidity and mortality in middle-aged White non-Hispanics in the US. Colour versions of the figures are available in the 'supplementary data' link in the above. (The long delay between writing, submitting, and the publication of the piece in IJE in some ways supports the arguments I make in the commentary, that timeliness is key, and blogs - and arxiv - allow for a much faster pace of research and analysis.) An example of the more general approach I try to promote to looking at outcomes which vary by age and year is provided below, where I used data from the Human Mortality Database to produce a 3D printed 'data cube' of log mortality by age and year, whose features I then discuss. [See here and here.] Seeing the data arranged in this way also makes it possible to see when the data quality improves, for example, as you can see the texture of the surface change from smooth (imputed within 5/10 year intervals) to rough. I agree with your willingness to explore data visually to establish ground truths which your statistical models then express and explore more formally. (For example, in your identification of cohort effects in US voting preferences.) To this end I continue to find heat maps and contour plots of outcomes arranged by year and age a simple but powerful approach to pattern-finding, which I am now using as a starting point for statistical model specification. The arrangement of data by year and age conceptually involves thinking about a continuous 'data surface' much like a spatial surface. Given this, what are your thoughts on using spatial models which account for spatial autocorrelation, such as in R's CARBayes package, to model demographic data as well?My reply: I agree that visualization is important. Regarding your question about a continuous surface: yes, this makes sense. But my instinct is that we'd want something tailored to the problem; I doubt that a CAR model makes sense in your example. Those models are rotationally symmetric, which doesn't seem like a property you'd want here. If you _do_ want to fit Bayesian CAR models, I suggest you do it in Stan. Minton responded:

I agree that additional structure and different assumptions to those made by CAR would be needed. I'm thinking more about the general principle of modeling continuous age-year-rate surfaces. In the case of fertility modeling, for example, I was able to follow enough of this paper (my background is as an engineer rather than statistician) to get a sense that it formalises the way I intuit the data. In the case of fertility, I also agree with using cohort and age as the surface's axes rather than year and age. I produced the figure in this poster, where I munged Human Fertility Database and (less quality assured but more comprehensive) Human Fertility Collection data together and re-arranged year-age fertility rates by cohort to produce slightly crude estimates of cumulative cohort fertility levels. The thick solid line shows at which age different cohort 'achieve' replacement fertility levels (2.05), which for most countries veers off into infinity if not achieved by around the age of 43. The USA is unusual in regaining replacement fertility levels after losing them, which I assume is a secondary effect of high migration, and migrant cohorts bringing with them a different fertility schedule with them than non-migrants. The tiles are arranged from most to least fertile in the last recorded year, but the trends show these ranks will change over time, and the USA may move to top place.The post Spatial models for demographic trends? appeared first on Statistical Modeling, Causal Inference, and Social Science.

I had an email exchange with someone the other day. He had a paper with some graphs that I found hard to read, and he replied by telling me about the software he used to make the graphs. It was fine software, but the graphs were, nonetheless, unreadable.
Which made me realize that people are thinking about graphics software the wrong way. People are thinking that the software makes the graph for you. But that's not quite right. The software allows you to make a graph for yourself.
Think of graphics software like a hammer. A hammer won't drive in a nail for you. But if you have a nail and you know where to put it, you can use the hammer to drive in the nail yourself.
This is what I told my correspondent:
Writing takes thought. You can't just plug your results into a computer program and hope to have readable, useful paragraphs.
Similarly, graphics takes thought. You can't just plug your results into a graphics program and hope to have readable, useful graphs.
The post Graphics software is not a tool that makes your graphs for you. Graphics software is a tool that allows you to make your graphs. appeared first on Statistical Modeling, Causal Inference, and Social Science.

Following up on a conversation regarding publicizing scientific research, Jim Savage wrote:

Here's a report that we produced a few years ago on prioritising potential policy levers to address the structural budget deficit in Australia. In the report we hid all the statistical analysis, aiming at an audience that would feel comfortable reading a broadsheet newspaper. In terms of impact, the report really hit the mark--front page of every national newspaper, and was the centre of political discourse for weeks. Longer-term, our big proposals were more or less adopted by both sides of politics. Some strategies that we used that I think paid off (I can't claim credit for these--my old boss John was a master at the dark arts): - _A surprise to no insiders._ We spent about a year on the report, talking to policymakers and those who'd be hostile to our ideas (lobby groups, mainly) throughout. By the time it was released, the insiders knew what to say about it, and we had good arguments against the detractors. - _Prioritising in terms of political cost_ (as well as potential budget gains, economic costs) was well received. - The _"supporting analysis" deck_ was a hit with political staffers and journalists. We provided Excel files containing all the plots to any media outfit that asked. Anything that makes journalists' jobs easier, sadly, will get more media time. - _Briefing, briefing, briefing._ In the two weeks before release, we took a 2-page summary (only charts) around to any journalist/politician who'd listen. That gave them time to write their pieces well in advance. Apparently this is PR 101, but it was completely new to me. And I think the approach gave the paper a great run among those we wanted to influence.These are interesting ideas that we can all think about when we have some policy-relevant results to convey from our research. This seemed worth blogging, on the theory that our blog readers are, on average, doing good things and so we should spread these useful public relations tips. Positive-sum advice, I hope. The post Tips when conveying your research to policymakers and the news media appeared first on Statistical Modeling, Causal Inference, and Social Science.

Gronau and Wagemakers write:

The bridgesampling package facilitates the computation of the marginal likelihood for a wide range of different statistical models. For models implemented in Stan (such that the constants are retained), executing the code bridge_sampler(stanfit) automatically produces an estimate of the marginal likelihood.Full story is at the link. The post Computing marginal likelihoods in Stan, from Quentin Gronau and E. J. Wagenmakers appeared first on Statistical Modeling, Causal Inference, and Social Science.

I'm speaking for the statistics undergraduates tomorrow (Fri 17 Nov) 10am in room 312 Mathematics Bldg. I'm not quite sure what I'll talk about: maybe I'll do again my talk on statistics and sports, maybe I'll speak on the statistical crisis in science. Anyone can come; especially we'd like to attract undergraduates--not just statistics majors--to learn more about our field.
The post My talk tomorrow (Fri) 10am at Columbia appeared first on Statistical Modeling, Causal Inference, and Social Science.

I came across this news article by Brian Resnick entitled:

The oldest human lived to 122. Why no person will likely break her record. Even with better medicine, living past 120 years will be extremely unlikely.I was skeptical, and I really didn't buy it after reading the research article, "Evidence for a limit to human lifespan," by Xiao Dong, Brandon Milholland and Jan Vijg, that appeared in Nature. As I wrote in an email to Resnick: "No no no no no on 'The oldest human lived to 122. Why no person will likely break her record." So much of it seems ridiculous to me. The news article says, "In all, they determined the probability that someone will reach age 125 in any given year 'is less than 1 in 10,000.' Or put another way: A 125-year-old human is a once-in-10,000-year occurrence." But the headline refers to someone living to 122 or 123, not to 125. And that's already happened once, right? tl;dr: _If someone has a mathematical model claiming that something that actually did happen, is extremely unlikely to happen, this to me is evidence that the model is flawed._ I can see how Nature--which is a bit of a "tabloid"--would publish such a thing, but I was unhappy to see a neutral journalist falling for this. I recommend a bit of skepticism. The news article concludes: "Calment, meanwhile, should rest easy in her grave that her record will be around for a long, long time." I wouldn't be so sure. I clicked through, and the paper has various weird things. For example, they report that maximum reported age of death has been decreasing in recent years, but if you look carefully these estimates have huge uncertainties (that's what it means when they say P=0.27 and P=0.70). Their curves look pretty but are basically overfitting; that is, they're correct when they write that one "could explain these results simply as fluctuations." They write, "we modelled the MRAD as a Poisson distribution; we found that the probability of an MRAD exceeding 125 in any given year is less than 1 in 10,000." But there's no reason at all that this model should make sense _at all_. To summarize: There's nothing wrong with them rooting around in the data and looking for patterns; we can learn a lot that way. But it's a mistake to present such speculations as anything more than speculation. I don't think statements such as "In fact, the human race is not very likely to break that record, ever," are doing anyone any favors. To put it another way: If you saw such extreme claims from a political advocacy group, you'd be skeptical, right? I recommend the same skepticism when you see something in a scientific publication. Please please please don't think that, just cos something's published in Nature, that this is a guarantee that it's sound science. You really have to look carefully at the paper. And this one isn't so hard to look at; they're not doing anything really technical here. I also sent this to science journalist Ed Yong, who was quoted in the news article. Yong replied:

So Vijg was clear to me in our interview that the change after the mid-90s shouldn't be seen as a decrease since it's non-significant. He billed it as a plateau; it's more that the significant increase before that point no longer continues. I did ask him about things like outliers and the choice of 1995 as a breakpoint. He said that the results are the same even if you take out Calment as the most obvious outlier, and whichever year you pick as the breakpoint. From him:To which I responded: Let me put it this way, then: My problem is in going from "A linear regression with a small number of data points has a trend coefficient which, when fit to the past twenty years, is not statistically significantly different from zero" to "they determined the probability that someone will reach age 125 in any given year 'is less than 1 in 10,000′" and "In fact, the human race is not very likely to break that record, ever." Also, I think it's a bit strange for them to say both "the data are noisy" and "the statistics are clear." Their Poisson distribution seems to come out of nowhere. There was also this quote from Vijg: "When Calment died at 122, everyone said it’ll only be a matter of time before we have someone who’s 125 or 130." That also seems a bit misleading in that there's a big difference between 122 and 125, and a really big difference between 125 and 130! Each year becomes harder to achieve (at least, until there's some medical breakthrough). From a news perspective, this is not serious science, it's just a fun feature story. I think Vijg is misunderstanding the difference between interpolation and extrapolation, but, hey, that's how he got published in Nature! The post No no no no no on "The oldest human lived to 122. Why no person will likely break her record." appeared first on Statistical Modeling, Causal Inference, and Social Science.There simply is no significant increase from the early 1990s onwards. I am sure that some people will argue that the upward trend may continue soon enough. While we agree that the data are noisy, which is to be expected, the statistics are clear. Fortunately, all databases are public so everyone who wishes can do the math and disagree with us.

A bunch of items came in today, all related to the replication crisis:
- Valentin Amrhein points us to this fifty-authored paper, "Manipulating the alpha level cannot cure significance testing – comments on Redefine statistical significance," by Trafimow, Amrhein, et al., who make some points similar to those made by Blake McShane et al. here.
- Torbjørn Skardhamar points us to this paper, "The power of bias in economics research," by Ioannidis, Stanley, and Doucouliagos, which is all about type M errors, but for a different audience (economics instead of psychology and statistics), so that's a good thing.
- Jonathan Falk points us to this paper, "Consistency without inference: Instrumental variables in practical Application," by Alwyn Young, which argues, convincingly, that instrumental variables estimates are typically too noisy to be useful. Here's the link to the replication crisis: If IV estimates are so noisy, how is it that people thought they were ok for so long? Because researchers had so many unrecognized degrees of freedom that they were able to routinely obtain statistical significance from IV estimates--and, traditionally, once you have statistical significance, you just assume, retrospectively, that your design had sufficient precision.
It's good to see such a flood of articles of this sort. When it's one or two at a time, the defenders of the status quo can try to ignore, dodge, or parry the criticism. But when it's coming in from all directions, this perhaps will lead us to a new, healthy consensus.
The post 3 more articles (by others) on statistical aspects of the replication crisis appeared first on Statistical Modeling, Causal Inference, and Social Science.

From Private Eye 1399, in Pseuds Corner:

What is a sandpit? Sandpits are residential interactive workshops over five days involving 20-30 participants; the director, a team of expert mentors, and a number of independent stakeholders. Sandpits have a highly multidisciplinary mix of participants, some active researchers and others potential users of research outcomes, to drive lateral thinking and radical approaches to address research challenges. [continues for three pages]Here's the webpage, from the Engineering and Physical Sciences Research Council (U.K.). That's right, social scientists aren't the only ones who have to put up with this sort of b.s. And get this:

Due to group dynamics and continual evaluation it is not possible to 'dip in and out' of the process. Participants must stay for the whole duration of the event.I just hope they let the participants go into town for the occasional meal, and they don't stick them with cafeteria food for five straight days. Lateral thinking, indeed. The post "What is a sandpit?" appeared first on Statistical Modeling, Causal Inference, and Social Science.

Eric Tassone writes:

Have you seen this? "Suns Tracking High Fives to Measure Team Camaraderie." Key passage:I sent this along to Josh Miller, who wrote:Although this might make basketball analytic experts scoff, there is actually some science behind the theory. Dacher Keltner, Professor of Psychology at UC Berkeley, in 2015 took one game of every NBA team at the start of the year and coded all of the fist bumps, embraces and high fives. “Controlling for how much money they’re making, the expectations that they would do well during that season, how well they we’re doing in that game,” Keltner said. “Not only did they win more games but there’s really nice basketball statistics of how selfless the play is.” Keltner found that the teams that made more contact with each other were helping out more on defense, setting more screens, and overall playing more efficiently and cooperatively.The Suns tracking of high fives and the like has been in the news this week. I tried to find recent publications involving Keltner on this topic, and maybe I missed them, as so far I've only found this from 2010 and some press coverage from roughly 2010 and 2011. Google Scholar also doesn't seem to know of any recent NBA-related publications by Keltner on the topic. But there is a two-minute YouTube video (the "Do High Fives Help Sports Teams Win?" from the subject line of this email) from Sept. 2015, on what appears to be a YouTube channel affiliated with the University of California, Berkeley. (YouTube suggested this--"NBA's Top 10 Missed High Fives"--as the next video for me to watch! (link) Boom!)

Sure is interesting. I'd bet the perfunctory low-fives after a missed free throw don't predict much. The thing I'd worry about here is reverse causality, do they address that?Tassone:

I didn't read the paper (yet), but my take on the short YouTube video is that it doesn't do much in this regard other than mostly elide the causal issue, but give some hints that it's 'high-fives causes winning/good play', as opposed to 'winning/good play causing high-fives,' or some other alternative. But maybe I should watch it again. (It's not unlike the admittedly complex case in baseball, where you might have an argument that winning causes high payrolls, as opposed to the conventional wisdom that it's the other way around.)Miller:

Just took a quick look: It says he studied only one game at "the start of the year," as a predictor for an *entire* season, in this case reverse causality shouldn't be too much of an issue. It's also says he controlled for salary, team expectations (not sure how, betting odds?), and how well they were doing in that particular game. Now the crucial issue is that it was only 2015 data and there are only 30 teams. Somehow I am skeptical. I bet he has some degrees of freedom in his controls. Even so, let's say it is robust and true every year, given his controls. If this measure of comraderie has additional predictive power beyond his controls, is it comraderie per se, or is it excitement about insider information that betting markets don't yet have? How good are his controls for expectations and performance that game? I just googled, it looks like the paper is from 2010 not 2015: And Eric did provide the link to a version of the paper. We can probably assume the published version isn't much different. Now if it is from 2010, I think we can make all sorts of assumptions about the statistical methods without even looking. I browsed quickly. Andrew- you would have a field day with this paper! Higher paid players touch each other more, because you know, status! But seriously, it's not in Science, Nature of PPNAS, so it's a bit too easy to tear this apart on your blog, no? The only way I see it working is if you do a self-aware post with a picture of some fish swimming around in a barrel. I spent just 2-5 min reading how they coded the data and analyzed it. They control for some obvious confounds, but one at a time, and not all at once, and then pile on one significant p-value after another, in an accumulating evidence sort of way, even though each test has an obvious confound. At the end they do perform an all-at-once regression, but there are bigger issues, aside from the fact that the estimates are not precise--remember we have just 30 teams, chasing noise anyone? The big problem (among many) is measurement error: e.g. they control for expected team performance during the season using a binary variable 1=some executive thinks they will make the playoffs or something, and -1= they don't. No mention of using betting odds, no mention that some of these games are already 2-months into the season and comraderie could reflect the team chemistry and performance up to that point (they control for team performance only in a single game), yes reverse causality is an issue. There is no way to replicate their analysis without their data because they don't say which games they coded and they don't give precise details about their coding procedure.Eric:

I also noticed the researcher in question, Dacher Keltner, apparently had something to do with the Oscar-winning film "Inside Out" (!!!).The post High five: "Now if it is from 2010, I think we can make all sorts of assumptions about the statistical methods without even looking." appeared first on Statistical Modeling, Causal Inference, and Social Science.

Dahyeon Jeong wrote:

While I was reading your today’s post "Some people are so easy to contact and some people aren’t”, I've come across your older posts including “Edlin’s rule for routinely scaling down published estimates.” In this post you write:Jeong's email was in 2016, and my quote above is from 2014. In the meantime, Eric Loken and I finally wrote that paper: it came out early this year. Here's our article, and here and here are some relevant blog posts. So we do make progress. Slowly. The post I hate that "Iron Law" thing appeared first on Statistical Modeling, Causal Inference, and Social Science.Also, yeah, that Iron Law thing sounds horribly misleading. I’d not heard that particular term before, but I was aware of the misconception. I’ll wait on posting more about this now, as a colleague and I are already in the middle of a writing a paper on the topic.I was especially curious about this, so I've searched your blog and CV, but I didn’t find a relevant follow-up post/article on this topic. If there’s indeed no post on this, I would really look forward to reading it at some point in future.

Ryan Bain writes:

I came across your 'Fitting Multilevel Models When Predictors and Group Effects Correlate' paper that you co-authored with Dr. Bafumi and read it with great interest. I am a current postgraduate student at the University of Glasgow writing a dissertation examining explanations of Euroscepticism at the individual and country level since the onset of the economic crisis. I employ multilevel modeling with two levels: individuals within states. As I am examining predictors of Euroscepticism at the country level, I employ random effects as individuals are clustered within countries. My supervisor pointed me in the direction of your paper as a means for controlling for omitted variable bias by ensuring that my country-level predictors are not correlated with my random effect parameter. I recently discovered an article by Jonathan Kelley, M. D. R. Evans, Jennifer Lowman and Valerie Lykes: 'Group-mean-centering independent variables in multi-level models is dangerous'. After working through a series of examples, the paper suggests that the practice be abandoned. The authors demonstrate, after group mean centering individual-level independent variables, that group mean centering country-level variables in regression models results in incorrect estimations of the coefficients for country-level (and individual-level) predictors being produced. The authors summarise their doubts about the method on pg.15 in the '5 Summary' section. However, all of their criticisms about the use of the method and the adverse consequences that group mean centering has on estimates of country-level predictors are based on models that also have the individual-level predictors group mean centered. The authors of the article only briefly reference the purpose of group mean centering as a means of controlling for omitted variable bias at the contextual level, on pg.3 stating: "Raudenbush and Bryk (2002) also posit that group-mean centering can reduce bias in random component variance estimates". That passing reference is all that the authors make in regards to the use of group mean centering for this purpose. They also cite other authors who criticise the method but, again, all of their issues with the method relate to models in which individual level predictors are centered. In 'Centering Predictor Variables in Cross-Sectional Multilevel Models: A New Look at an Old Issue' by Craig K. Enders and Davood Tofighi (2007), for example, the authors state on pg.121 that: "the centering of Level 2 (e.g., organizational level) variables is far less complex than the centering decisions required at Level 1, as it is only necessary to choose between the raw metric and CGM[centered at the grand mean]; CWC[centering within cluster(which the authors refer to group mean centering as)] is not an option because each member of a given cluster shares the same value on the Level 2 predictor. Centering decisions at Level 2 generally mimic prescribed practice from the OLS regression literature (Aiken & West, 1991), so the focus of this article is on centering at Level 1. Throughout the remainder of the article, we assume that all Level 2 predictors are centered at their grand mean." Could please provide any guidance on this matter? The Kelley et al. (2016) article has made me doubt the use of group mean centering for controlling for omitted variable bias yet I am not sure if that was it's intention for models in which only the country-level predictors were group mean centered.My reply: Yes, rather than thinking about centering the group means, I prefer to think about it as adding new predictors at the group level. In sociology they sometimes talk about individual and contextual effects, but more generally we can just speak predictively and say that the individual predictor and its group-level average can both be predictive of the outcome. Bain adds:

What I believe has happened with this paper is that the authors assert that the group mean centered individual level coefficients are inappropriate because the within effect introduces additional level 2 error. But the authors do not mean the within effect (they stay clear of this terminology but it is what their argument is referring to). They are actually discussing the difference between the within and between effect. Throughout their article the authors examined the mean of the correlated random effects (cre) model in their analysis which represents the between-within difference. Essentially, because the authors have examined the effects of the mean of the cre model they've compared and contrasted the coefficient of the mean of the individual-level variable of interest in the cre model with the original coefficient in a random effects model. With their focus on the mean - the difference between the within and between effect - they believed that this was the coefficient which represented the within effect hence why they've (incorrectly) argued that the within effect is confounded with the level 2 error (because the mean is what they focused on which obviously is confounded with the level 2 error in the cre model).The post Fitting multilevel models when predictors and group effects correlate appeared first on Statistical Modeling, Causal Inference, and Social Science.

Someone writes:

I'm currently a PhD student in the social sciences department of a university. I recently got involved with a group of professors working on a project which involved some costly data-collection. None of them have any real statistical prowess, so they came to me to perform their analyses, which I was happy to do. The problem? They want me to p-hack it, and they don't even know it. The project reads like one of your blog posts. The professors want to send this to a high-impact journal (they said Science, Nature, and The Lancet were their first three). There is no research question, and very little underlying theory. They essentially dumped the data on me and told me to email them when "you find something significant." The worst part is, there is no malicious intent here and I don't think they even know they they're just fishing for p <.05. These are genuinely good, smart people who just want to do a cool study and get some recognition. I dont know if you have any advice to handling this sort of situation.My recommendation is to do the best analysis you can, given your time constraints. If there are many potential things to look at, you might want to fit a multilevel model. In any case, write up what you did, make graphs of data and fitted model, give the manuscript to the professors and let them decide where to submit it. You'll have a lot more control over the project if you write up your findings as a real paper, with a title, abstract, paragraphs, data and methods section, results, conclusions, and graphs. Don't just send them a bunch of printouts as if you're some kind of cog in the machine. Write something up. My guess is that your colleagues/supervisors will appreciate this: Writing up results is a lot of work, and a student who can write is valuable. Here are some tips on writing research articles. It's fine if these profs want to change your paper, or rewrite it, or incorporate it into what you wrote (as long as they give you appropriate coauthorship). If in all this manipulation they want to submit something you don't like, for example if they start pulling out p-values and telling bogus stories, then tell them you're not happy with this! Explain your problems forthrightly. Ultimately it might come to a breakup, but give these colleagues of yours a chance to do things right, and give yourself a chance to make a contribution. And if it doesn't work out, walk away: at least you got some practice with data analysis and writing. The post What should this student do? His bosses want him to p-hack and they don't even know it! appeared first on Statistical Modeling, Causal Inference, and Social Science.

We're in the heart of the academic season and there's a lot going on.
* JAMES RAMSEY reported a critical performance regression bug in Stan 2.17 (this affects the latest CmdStan and PyStan, not the latest RStan). SEAN TALTS and DANIEL LEE diagnosed the underlying problem as being with the change from char* to std::string arguments--you can't pass char* and rely on the implicit std::string constructor without the penalty of memory allocation and copying. The reversion goes back to how things were before with const char* arguments. BEN GOODRICH is working with SEAN TALTS to cherry-pick the performance regression fix to Stan that led to a very slow 2.17 release for the other interfaces. RStan 2.17 should be out soon, and it will be the last pre-C++11 release. We've already opened the C++11 floodgates on our development branches (yoo-hoo!).
* QUENTIN F. GRONAU, HENRIK SINGMANN, E. J. WAGENMAKERS released the bridgesampling package in R. Check out the arXiv paper. It runs with output from Stan and JAGS.
* ANDREW GELMAN and BOB CARPENTER's proposal was approved by Coursera for a four-course introductory concentration on Bayesian statistics with Stan: 1. Bayesian Data Analysis (Andrew), 2. Markov Chain Monte Carlo (Bob), 3. Stan (Bob), 4. Multilevel Regression (Andrew). The plan is to finish the first two by late spring and the second two by the end of the summer in time for Fall 2018. ADVAIT RAJAGOPAL, an economics Ph.D. student at the New School is going to be leading the exercise writing, managing the Coursera platform, and will also TA the first few iterations. We've left open the option for us or others to add a prequel and sequel, 0. Probability Theory, and 5. Advanced Modeling in Stan.
* DAN SIMPSON is in town and dropped a casual hint that order statistics would clean up the discretization and binning issues that SEAN TALTS and crew were having with the simulation-based algorithm testing framework (aka the Cook-Gelman-Rubin diagnostics). Lo-and-behold, it works. MICHAEL BETANCOURT worked through all the math on our (chalk!) board and I think they are now ready to proceed with the paper and recommendations for coding in Stan. As I've commented before, one of my favorite parts of working on Stan is watching the progress on this kind of thing from the next desk.
* MICHAEL BETANCOURT tweeted about using ANDREI KASCHA's javascript-based vector field visualization tool for visualizing Hamiltonian trajectories and with multiple trajectories, the Hamiltonian flow. RICHARD MCELREATH provides a link to visualizations of the fields for light, normal, and heavy-tailed distributions. The Cauchy's particularly hypnotic, especially with many fewer particles and velocity highlighting.
* KRZYSZTOF SAKREJDA finished the fixes for standalone function generation in C++. This lets you generate a double- and int-only version of a Stan function for inclusion in R (or elsewhere). This will go into RStan 2.18.
* SEBASTIAN WEBER reports that the _Annals of Applied Statistics_ paper, Bayesian aggregation of average data: An application in drug development, was finally formally accepted after two years in process. I think Michael Betancourt, Aki Vehtari, Daniel Lee, and Andrew Gelman are co-authors.
* AKI VEHTARI posted a case study for review on extreme-value analysis and user-defined functions in Stan [forum link -- please comment there].
* AKI VEHTARI, ANDREW GELMAN and JONAH GABRY have made a major revision of Pareto smoothed importance sampling paper with improved algorithm, new Monte Carlo error and convergence rate results, new experiments with varying sample size and different functions. The next loo package release will use the new version.
* BOB CARPENTER (it's weird writing about myself in the third person) posted a case study for review on Lotka-Volterra predator-prey population dynamics [forum link -- please comment there].
* SEBASTIAN and SEAN TALTS led us through the MPI design decisions about whether to go with our own MPI map-reduce abstraction or just build the parallel map function we're going to implement in the Stan language. Pending further review from someone with more MPI experience, the plan's to implememt the function directly, then worry about generalizing when we have more than one function to implement.
* MATT HOFFMAN (inventor of the original NUTS algorithm and co-founder of Stan) dropped in on the Stan meeting this week and let us know he's got an upcoming paper generalizing Hamiltonian Monte Carlo sampling and that his team at Google's working on probabilistic modeling for Tensorflow.
* MITZI MORRIS, BEN GOODRICH, SEAN TALTS and I sat down and hammered out the services spec for running the generated quantities block of a Stan program over the draws from a previous sample. This will decouple the model fitting process and the posterior predictive inference process (because the generated quantities block generates a ỹ according to p(ỹ | θ) where ỹ is a vector of predictive quantities and θ is the vector of model parameters. Mitzi then finished the coding and testing and it should be merged soon. She and BEN BALES are working on getting it into CmdStan and BEN GOODRICH doesn't think it'll be hard to add to RStan.
* MITZI MORRIS extended the spatial case study with leave-one-out cross-validation and WAIC comparisons of the simple Poisson model, a heterogeneous random effects model, a spatial random effects model, and a combined heterogeneous and spatial model with two different prior configurations. I'm not sure if she posted the updated version yet (no, because Aki is also in town and suggested checking Pareto khats, which said no).
* SEAN TALTS split out some of the longer tests for less frequent application to get distribution testing time down to 1.5 hours to improve flow of pull requests.
* SEAN TALTS is taking another one for the team by leading the charge to auto-format the C++ code base and then proceed with pre-commit autoformat hooks. I think we're almost there after a spirited discussion of readability and our ability to assess it.
* SEAN TALTS also added precompiled headers to our unit and integration tests. This is a worthwhile speedup when running lots of tests and part of the order of magnitude speedup Sean's eked out.
ps. some edits made by Aki
The post Stan Roundup, 10 November 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

Kyle MacDonald writes:

I wondered if you'd heard of Purvesh Khatri's work in computational immunology, profiled in this Q&A with Esther Landhuis at Quanta yesterday. Elevator pitch is that he believes noisy, heterogeneous data scoured from diverse sources make his metanalyses stronger. The thing that gave me the woollies was this line:My response: I haven't read Khatri's research articles and I know next to nothing about this field of research so I can't really say. Based on the above-quoted news article, the work looks great. Regarding your question: On one hand, yes, it seems mistaken to have more confidence in one's findings because the data were noisier. On the other hand, it's not clear that by "dirty data," he means "noisy data." It seems that he just means "diverse data" from different settings. And there I agree that it should be better to include and model the variation (multilevel modeling!) than to study some narrow scenario. It also looks like good news that he uses training and holdout sets. That's something we can't always do in social science but should be possible in genetics where data are so plentiful. The post Noisy, heterogeneous data scoured from diverse sources make his metanalyses stronger. appeared first on Statistical Modeling, Causal Inference, and Social Science.“We start with dirty data,” he says. “If a signal sticks around despite the heterogeneity of the samples, you can bet you’ve actually found something.”On the one hand, that seems like an almost verbatim restatement of your "what doesn't kill my statistical significance makes it stronger" fallacy. On the other hand, he seems to use his methods purely to look for things to test empirically, rather than to draw conclusions based on the analysis, which is good, and might mean that the fallacy doesn't apply. I also like his desire to look for connections that isolated groups might miss:I realized that heart transplant surgeons, kidney transplant surgeons and lung transplant surgeons don’t really talk to each other!I'd be interested in hearing your thoughts: worth the noise if he's finding connections that no one would have thought to test?

So. I got this email one day, promoting a book that came with the following blurb:

Whither Science?, by Danko Antolovic, is a series of essays that explore some of the questions facing modern science. A short read at only 41 pages, Whither Science? looks into the fundamental questions about the purposes, practices and future of science. As a global endeavor, which influences all of contemporary life, science is still a human creation with historical origins and intellectual foundations. And like all things human, it has its faults, which must be accounted for.It sounded like this guy might be a crank, but they sent me a free copy so I took a look. I read the book, and I liked it. It's written in an unusual style, kinda like what you might expect from someone with a physics/chemistry background writing about social science and philosophy. But that's ok. Antolovic deserves to be recognized as the next Nassim Taleb--by which, I mean a plain-speaking yet deep revealer of true structures, a philosophical autodidact with a unique combination of views. The book is worth reading. p.6, "Today, the practitioner of science is almost without exception an employee of a larger corporate entity (a university or a company) or of a national government. He is hemmed in by the tangible constraints of his terms of employment and funding, and by the less tangible ones of departmental, institutional and funding politics. He labors in a crowded field, in which there are increasingly fewer stones left unturned, and he climbs the ladder of corporate seniority until he retires." p.7, "After the Second World War, science went from being the province of the few to becoming the career path of many." "Since scientific development is fundamentally important to the well-being of modern societies, it is easy to see the benefits of exalting this decidedly un-adventurous walk of life with the help of a heroic foundation story. In the eyes of the supporting public, and in those of prospective practitioners, present-day science is the heir and descendant of the heroic achievements that dispelled the darkness of superstition, changed our image of the universe, and wonder-worked what we today know as the industrial world. And so it is, but we should examine the heir on his own merits." Well put. I have nothing to add. "Market economy is usually held up as the paragon of a robust and efficient mechanism by which to produce and distribute wealth. For it to function, it must have a sufficiently large number of economic “players” (individuals and companies), and a pool of as of yet unowned resources – energy or raw materials – that are available for the taking. Players invest their labor, and their already owned wealth, to appropriate the resources; they work the raw resources into things that they and others consider valuable, and they trade with each other in the quest for greater wealth." "We must point out that the pool of unowned resources is an essential factor for the competitive market to exist: that is what the market players compete for, either directly, by extracting the resources themselves, or indirectly, by trading with others in the wealth derived from these resources." Compare to the hypothetical desert island whose inhabitants survive economically by taking in each others' laundry. Or various poor countries, or poor regions of countries, that just don't have enough unowned resources to go around. What economist Tyler Cowen calls Zero Marginal Product zones. Just as fishing technology has allowed humans to grab all the fish, and oil drilling and coal mining technology threaten to remove that pool of unowned fossil fuel resources, so does economic development threaten to kill the golden goose etc. I'll have to think about this one. If it's really true that economic exchange relies on that pool of unowned resources, then the market economy is self-defeating. Cultural contradictions of capitalism but in a different way. This is interesting because economists often recommend solving problems of unowned resources by giving them owners. Rhinos, fish, the disaster that was post-communist Russia. But to put it in Antolovic's terms, "If the resources are intentionally distributed among the players, again by political means, we have a form of planned economy." So in that way a mixed economy is not an economic abomination or even a regrettably unavoidable political necessity but a natural absorbing state. I wonder what Jeff Sachs would think of this. p.8, "The wealth of the participants is not tokenized by money, but by a less rigidly defined currency, which we will refer to as prestige. . . . Participants use their existing prestige to appropriate the funding resources, which they convert into further prestige via the process of performing scientific research. Direct trading in research results is proscribed as unethical, since the results must nominally be original and attributable to a researcher. However, scientific results are merely ancillary to the accumulation of prestige, and prestige is freely traded for labor and further prestige: this is the politics of who collaborates with whom, who is hired in which department or research laboratory etc. Typically, those with less prestige offer their labor to those with more, with the objective of increasing their own prestige and share of resources by association." Yup. That has the "anthropologist on Mars" ring of truth. It describes what goes on, what I and others do. It's important to be clear-eyed without being cynical. Prestige is the currency of science, but that does not mean that prestige is the reason we do things, nor does it mean that science is all about prestige. We have many goals in doing science, including discovery, serving societal goals, teamwork, and the joys of the scientific endeavor itself. Given that people will do crossword puzzles for diversion, it's not such a stretch to think that science can be fun too. One can draw an analogy to acting, where one could say the currency is fame or reputation; or professional team sports, where players are motivated both to win and to improve their personal statistics. To recognize certain goals should not be taken as to deny the existence of others. To get back to science and its coins of prestige, I take Antolovic's point to be, not that scientists are hypocrites to claim to seek discovery when they are nothing but careerists, or that scientists think themselves rational but are actually ruled by the same instincts, urges, and motivations that drive a society of bonobos, but rather that the accumulation and trading in of prestige is at this point a necessity for most scientists; it is baked into the scientific economy. Consider my own case. I know myself well enough to recognize that I have an innate desire for prestige and acclaim. As a child I enjoyed being praised, and for decades now I've been thrilled when people come up to me and say they loved my talk, or that they've learned so much from my books. OK, fine. But that's not why I do what I do. It's more of an pleasant byproduct. I don't choose to work on based on what will give me more praise or happy feedback, except to the extent that I want my work to be useful to others--I _am_ a statistician, after all!--in which case the beneficiaries of my labors might well choose to thank me, which is fine. But--and here's where Antolovic's argument comes in--I do seek prestige, not so much for its own sake but because of what it can buy. Again, the prestige-as-money argument. I know some people for whom accumulation of money is a major goal in itself, but most of us want money for what it can buy, and for the security it can provide. Similarly, I seek the prestige and publications which will allow me to attract top collaborators and do the best work I can, and to get the funding to hire the programmers that can allow Stan to realize its destiny, thus advancing science and technology in ways that I would like. Prestige is the coin. It is true that my collaborators and I accumulate prestige, which we convert into grant funding and then into research results. We play the game because we want to do science. Prestige is not, by and large, the goal in itself. Antolovic writes, "Infantile gratification of personal vanity cannot remain the primary motivation for doing science." But I think he's missing the point here. Prestige buys us money, and money amplifies our research efforts, so we go for prestige for sensible instrumental reasons. Maybe also infantile gratification, but that's not the primary motivation. Any more than the primary motivation of businessmen is the infantile desire to hold shiny coins and green pieces of paper. One striking feature of the current crisis in science is the panic of people such as that embodied-cognition guy who'd built up great stores of the stuff--thousands and thousands of citations!--only to see science moving away from the germanium standard, as it were. (I don't enjoy the dilution of my own prestige, of course--my list of journals I've published in, is looking more and more like a collection of vinyl records--but there's nothing much I can do about it.) The economic analogy works well. The realization that one can easily print more money leads to inflation, then a need for more money, then hyperinflation. Just look at the C.V.'s of recent computer science Ph.D.'s: there's a pressure to publish dozens of conference papers a year. The field of statistics is more bimetallic, or multimetallic, with publications in various different sorts of journals. And, perhaps unsurprisingly, economics itself has, relatively speaking, remained a bastion of hard money, with the top five or so journals keeping much of their gold-standard status. (Which leads to troubles of its own, as in the career of Bruno Frey, and the recent brouhaha involving alleged insider favors in the American Economic Review.) But I digress. What I want to say here is that I appreciate Antolovic's insightful application of economic ideas to scientific research, and I hope that readers can get the point without getting lost in cynicism. Moving Stan forward costs a lot of money. Programmers need to be paid, and that means that I end up spending a lot of my time asking people for money. To draw yet another analogy, the currency of baking is not flour or yeast but, by and large, money. A successful baker can raise the funds to buy higher-quality ingredients, to expand the bakery, to try out new recipes, and so forth, allowing more money to be raised, etc. Or he or she can run a small shop with no grander goals but will still need to make enough money to live on. But the goal of just about everyone involved (setting aside the pure hacks) is to make bread. The system must ultimately be evaluated based on the quality and quantity of bread produced (along with related concerns such as variety and sustainability). p.12, "It is our thesis that the past half century or so has proven the bazaar-like approach to science a failure. This period has filled libraries with scientific publications to the point of bursting, while offering disappointingly little toward what has always been the underlying premise of the techno-scientific endeavor: betterment of the human condition. The great killing diseases of our time, cardiovascular disease and cancer, have remained with us through this period, and no fundamental approach to curing them is in sight. As the average population age creeps up, degenerative diseases of body and mind are becoming an ever greater economic drain, yet progress in that area moves at glacial pace. Even new infectious pathogens, such as HIV and the Ebola virus, seem to be more than what contemporary science can readily counter, despite very considerable advances in molecular and cell biology." Rather than argue the details of this, I want to remark on how refreshing this perspective is, to criticize the "bazaar-like approach" to anything. In a famous internet document from 1995, The Cathedral and the Bazaar, Eric Raymond contrasted the top-down and bottom-up or self-organizing approaches to construction and argued strongly and persuasively in favor of the latter. The cathedral is central planning, bureaucracy, and projects that take centuries to complete, at which point the original goals have become irrelevant. The bazaar is evolution, it's competition, it's small groups working together when they need to, and going their own way when appropriate. In the context of scientific research, the cathedral is big research labs and PNAS; the bazaar is Arxiv and internet comment sections. Or is it the other way around? Big research looks like a cathedral only from a distance; close-up it's thousands of competing research groups. Meanwhile, Arxiv is run by a small group, and much of the discussion on the internet has been absorbed within the walls of Facebook. Anyway, I don't plan any cathedral/bazaar manifesto myself, I just wanted to register my interest in Anotolovic's refusal to hold a reflexive pro-bazaar position. Instead, he recommends scientific management focused on a national level aimed at particular goals, rather than the current loose system where goals are stated but then money is given to research teams with little outside direction or management. I don't know how well this will work, but the possibility seems worth looking into. p.18, "Perhaps it is understandable that the supernatural has greater emotional traction in the human mind than the natural. The supernatural is the product of the mind itself, a story told to both stir and assuage the anxieties of a social animal: supernatural causes are always personal, they are somebody, good or evil. Empirical explanation, on the other hand, endeavors to discover causes that are unfamiliar, emotionally indifferent and invariably impersonal; there is, at the core of it, certain disappointing banality to every factual explanation." Well put. p.19, "Religion, specifically Roman Christianity, is of course the arch-villain of the foundational narrative of science, but from the perspective of the empiricist, the conflict, or at least the intellectual part of the conflict, is entirely avoidable: insofar that religion asserts that certain doctrines are factually true without presenting factual evidence, that assertion is intellectually worthless. Any theological speculations that do not make factual claims are open to consideration, discussion or disregard, as one may wish, but science has no inherent conflict with them." "Intellectually worthless" is a bit too strong: if I come up with an assertion without presenting factual evidence, I still may be making a contribution if my assertion is taken as a hypothesis or if it inspires others to useful thought. Just as one can argue, for example, that Jules Verne could've made a useful intellectual contribution to undersea exploration, even had he decided to insist on the factual existence of Captain Nemo. p.21, "Objections raised by romantic movements are substantive and conscientious, and they speak from the authority of their historical present. They do not represent a reflexive "opposition to progress," but rather they are a legitimate effort of the human mind to come to terms with the full implications of the changing image of the world, emotional as much as rational; we regard the romantic periods as an integral part of the story of empiricism." p.22, "A new scientific theory must account for those facts that were understood under the old one before it ventures to offer new explanations." Not quite! Sometimes science can make progress, working around well-known anomalies that resist clean explanation in any existing framework. Indeed, it could well be that certain aspects of the real world will never be explainable by human theories. 1/137, anyone? p.25, "Putting it in a straightforward way, secular ethics asks: How should I treat others? Should I “do unto others” as I would wish to be treated (or at least give them decent consideration), or should I do unto others whatever it takes to attain my own goals, goals which, in the absence of a credible supernatural authority, I am free to set however I please?" Well put. Here's my definition, from my first ethics column in Chance: "An ethics problem arises when you are considering an action that (a) benefits you or some cause you support, (b) hurts or reduces benefits to others, and (c) violates some rule." Antolovic continues, "Empirical observation convinces me that societies in which the golden rule is generally followed are happier, free of strife, and productive; reason tells me that I can live a good life in such a society, and it can guide me in contributing to its welfare, if I so choose. But reason also tells me that, under right circumstances and with right effort, I can acquire much more for myself by manipulating, destroying, robbing and enslaving others; the same reason will help me accomplish that objective also." p.29, "Since its 16th century beginnings, science has reached far and wide into the world of phenomena, and for perhaps a century now, it has continuously exploited its proven methods of investigation, making available an ever greater power over that phenomenal world. However, only a small fraction of its effort has been expended toward understanding the one thing which is both the source of scientific inquiry and the recipient of its fruits: the human mind." Not anymore, right? Neuroscience is a big deal these days. And psychology's been a big deal for awhile. Even more "external" social sciences have been turning inward; consider, for example, the claim by economists that theirs is the science of human behavior. And then there's computer science, machine learning, artificial intelligence. But this: "We accept, and always have accepted, that procreative aggression of young males – the bellicosity of the rut – will be harnessed for state’s purposes, making them into cannon fodder for whatever cause is being fought about at the moment. We accept that civilized peoples can and will be coaxed back into the depths of pre-civilized horde loyalty and set against some conveniently chosen outside group as the 'enemy.' We observe public words and actions of decision makers of nuclear-armed nations, and we recognize in them thinly disguised impulses of the dominant animal in a primate horde – and we accept that as natural. We allow the fruits of technological progress to be used for vertiginous enrichment of individuals who are devoid of all but a boundless drive for acquisition, and we do not see this drive as a pathology, a personality disorder: rather, we see it as a trait to be envied and lived out vicariously through admiration." Ouch. As a human, I feel the shock of recognition. p.32, "First truly scientific insights into the mind came with the work of Sigmund Freud. Freud's method of investigation was not empirical observation, but rather introspection, but he used introspection as if it were empirical observation of external phenomena. He regarded the patients' introspective monologues as authentic and reliable observables of the mind, although he did not treat them as literal reports, but as material to be analyzed. . . . Freud's work has in it much that is speculative, and it does not (yet) exhibit the rigor of a developed scientific discipline . . ." He's no Freud-worshipper: "Proponents of psychoanalysis in their turn believe that dark subconscious impulses and conflicts can be resolved by reason, once they have been brought into the light of consciousness by analysis. In reality, psychoanalytical approach has been shown to have limited success even in its original role as a clinical therapy for neuroses, and it is entirely impractical to think that the 'talking cure' could be employed to lead the broader mankind out of instinctual darkness." But: "The contribution of [Freud's] work lies in having proposed both a methodology and a set of working hypotheses in an area of science which is still deficient in that respect today." p.33, "Human governance throughout history has amounted mostly to murderous rule by individuals whose only claim to power was that they wanted it badly enough to fight for it; this accompanied by equally murderous sycophancy of the ruled, usually directed against the heretic, the infidel, the traitor to the cause, the 'other.' In modern times, unfettered overconsumption is practiced by most of the western populations, accompanied by equally grotesque over-accumulation of wealth and economic power by a few individuals. All of these behaviors can be readily recognized as driven by primitive instincts that were unilaterally freed from their natural constraints, their effects amplified by human power over nature." The rest of Antolovic's book is interesting too. The post "A mixed economy is not an economic abomination or even a regrettably unavoidable political necessity but a natural absorbing state," and other notes on "Whither Science?" by Danko Antolovic appeared first on Statistical Modeling, Causal Inference, and Social Science.

OK, not quite D&D--I just wrote that to get Bob's attention. It is a role-playing game, though!
Here's the paper, "Seeing the World Through the Other's Eye: An Online Intervention Reducing Ethnic Prejudice," by Gabor Simonovits, Gabor Kezdi, and Peter Kardos:

We report the results of an intervention that targeted anti-Roma sentiment in Hungary using an online perspective-taking game. We evaluated the impact of this intervention using a randomized experiment in which a sample of young adults played this perspective-taking game, or an unrelated online game. Participation in the perspective-taking game markedly reduced prejudice, with an effect-size equivalent to half the difference between voters of the far-right and the center-right party. The effects persisted for at least a month, and, as a byproduct, the intervention also reduced antipathy toward refugees, another stigmatized group in Hungary, and decreased vote intentions for Hungary's overtly racist, far-right party by 10%. Our study offers a proof-of-concept for a general class of interventions that could be adapted to different settings and implemented at low costs.Simonovits wrote:

The paper is similar to some existing social psychology studies on perspective taking but we made an effort to improve on the credibility of the analysis by (1) using a relatively large sample (2) registering and following a pre-analysis plan (3) using pre-treatment measures to explore differential attrition and (4) estimating long term effects of the treatment. It got desk-rejected from PNAS and Psych Science but was just accepted for publication in APSR.I agree that: (1) a large sample can't hurt, (2) preregistration makes this sort of result much more believable, (3) using pre-treatment variables can be crucial in getting enough precision to estimate what you care abut, and (4) richer outcome measures can help a lot. The post Using D&D to reduce ethnic prejudice appeared first on Statistical Modeling, Causal Inference, and Social Science.

Tom Wolfe on evolution:

I think it's misleading to say that human beings evolved from animals. I mean, actually, nobody knows whether they did or not.This is just sad. Does Wolfe really think this? My guess is he's trying to do a solid for his political allies. Jerry Coyne writes:

Somewhere on his mission to tear down the famous, elevate the neglected outsider and hit the exclamation-point key as often as possible, Wolfe has forgotten how to think.Well put. But I think Wolfe _does_ know how to think. You know what they say, right? "Any prosecutor can convict a guilty man. It takes a great prosecutor to convict an innocent man." Similarly, I think Wolfe takes it as a point of pride that, as a great writer, he can make the case for something as ridiculous as anti-Darwinism. And, after all, who goes to Tom Wolfe to learn about science? The man's an entertainer. This is not to defend Wolfe's statement, which is flat-out ridiculous, comparable to that of Kenneth Ludmerer, a professor of history and medicine at Washington University in St. Louis who testified that he had "no opinion" on whether cigarette smoking contributes to the development of lung cancer in human beings--and he said that in 2002, that's right, 38 years _after_ the Surgeon General's report. I just think we take it in context: Wolfe doesn't give a damn about science but he cares a lot about politics, so he probably thinks it's charming to say something ridiculous with a straight face, his way to give a poke in the eye to those pesky experts who know more than he does about something. That's right. Tom Wolfe is a low-rent G. K. Chesterton (or, to put it in modern terms, a witty, intelligent, socially conscious version of Michael Kinsley). The post When people proudly take ridiculous positions appeared first on Statistical Modeling, Causal Inference, and Social Science.

Matt Espe writes:

Here is a new paper citing Stan and the rstanarm package. Yield gap analysis of US rice production systems shows opportunities for improvement. Matthew B. Espe, Kenneth G. Cassman, Haishun Yang, Nicolas Guilpart, Patricio Grassini, Justin Van Wart, Merle Anders, Donn Beighley, Dustin Harrell, Steve Linscombe, Kent McKenzie, Randall Mutters, Lloyd T. Wilson, Bruce A. Linquist. Field Crops Research. Volume 196, September 2016, Pages 276–283. Many thanks to everyone on the development team for some excellent tools!I've not read the paper, but, hey, if Stan can improve U.S. rice yields by a factor of 1.5, that's cool. Then all our research will have been worth it. The post Using Stan to improve rice yields appeared first on Statistical Modeling, Causal Inference, and Social Science.

I'm speaking Mon 13 Nov, 6pm, at Low Library Rotunda at Columbia: ~~required; follow the link~~ currently full and closed. This will be a talk for a general audience.
The post The Statistical Crisis in Science--and How to Move Forward (my talk next Monday 6pm at Columbia) appeared first on Statistical Modeling, Causal Inference, and Social Science.

The Statistical Crisis in Science--and How to Move Forward Using examples ranging from elections to birthdays to policy analysis, Professor Andrew Gelman will discuss ways in which statistical methods have failed, leading to a replication crisis in much of science, as well as directions for improvements through statistical methods that make use of more information.Online reservation is

Jacob Schumaker writes:

Reformed political scientist, now software engineer here. Re: the hot hand fallacy fallacy from Miller and Sanjurjo, has anyone discussed why a basic regression doesn't solve this? If they have I haven't seen it. The idea is just that there are other ways of measuring the hot hand. When I think of it, it's the difference in the probability of making a shot between someone who just made a shot and someone who didn't. In that case, your estimate is unbiased right? The fallacy identified by Miller and Sanjurjo only matters if you analyze the data in a certain way, right?My quick answer: (a) hotness is not just about the last shot you made or missed, so yours will be a very noisy measure, and (b) with finite sequences, your approach will have the same bias as in Gilovich et al.'s estimate. To put it another way, the regression you can do on the data is not the regression you want to do; it's a regression with measurement error in x, and that gives you a biased estimate; also there's the selection issue that Miller and Sanjurjo discussed. There's really no easy way out. The post Why you can't simply estimate the hot hand using regression appeared first on Statistical Modeling, Causal Inference, and Social Science.

It would be interesting if someone were to make an exhibit for a museum showing the timeline of humans and hominids, and under that showing children's toys and literature, showing how these guys were represented in popular media. It probably already exists, right?
P.S. I feel kinda bad that this bumped Dan's more important, statistically-related post. So go back and read Dan's post again, hear?
The post Planet of the hominids? We wanna see this exposition. appeared first on Statistical Modeling, Causal Inference, and Social Science.

_But I got some ground rules I've found to be sound rules and you're not the one I'm exempting. Nonetheless, I confess it's tempting. – Jenny Toomey sings Franklin Bruno_
It turns out that I did something a little controversial in last week's post. As these things always go, it wasn't the thing I was expecting to get push back from, but rather what I thought was a fairly innocuous scaling of the prior. One commenter (and a few other people on other communication channels) pointed out that the dependence of the prior on the design didn't seem kosher. Of course, we (Andrew, Mike and I) wrote a paper that was sort of about this a few months ago, but it's one of those really interesting topics that we can probably all deal with thinking more about.
So in this post, I'm going to go into a couple of situations where it makes sense to scale the prior based on fixed information about the experiment. (The emerging theme for these posts is "things I think are interesting and useful but are probably not publishable" interspersed with "weird digressions into musical theatre / the personal mythology of Patti LuPone".)
If you haven't clicked yet, this particular post is going to be drier than Eve Arden in Mildred Pierce. If you'd rather be entertained, I'd recommend Tempting: Jenny Toomey sings the songs of Franklin Bruno. (Franklin Bruno is today's stand in for Patti, because I'm still sad that War Paint closed. I only got to see it twice.)
(Toomey was one of the most exciting American indie musicians in the 90s both through her bands [Tsunami was the notable one, but there were others] and her work with Simple Machines, the label she co-founded. These days she's working in musician advocacy and hasn't released an album since the early 2000s. Bruno's current band is called The Human Hearts. He has had a long solo career and was also in an excellent powerpop band called Nothing Painted Blue, who had an album called The Monte Carlo Method. And, now that I live in Toronto, I should say that that album has a fabulous cover of Mark Szabo's I Should Be With You. To be honest, the only reason I work with Andrew and the Stan crew is that I figure if I'm in New York often enough I'll eventually coincide with a Human Hearts concert.)
SPARSITY
_Why won't you cheat with me? You and I both know you've done it before. – _Jenny Toomey sings Franklin Bruno__
The first object of our affliction are priors that promote sparsity in high-dimensional models. There has been _a lot_ of work on this topic, but the cheaters guide is basically this:

While spike-and-slab models can exactly represent sparsity and have excellent theoretical properties, they are basically useless from a computational point of view. So we use scale-mixture of normal priors (also known as local-global priors) to achieve approximate sparsity, and then use some sort of decision rule to take our approximately sparse signal and make it exactly sparse.What is a scale-mixture of normals? Well it has this general form where is a global standard deviation parameter, controlling how large the parameters are in general, while the local standard deviation parameters control how big the parameter is allowed to be locally. The priors for and the are typically set to be independent. A lot of theoretical work just treats as fixed (or as otherwise less important than the local parameters), but this is wrong. _Interpretation note: _This is a crappy parameterisation. A better one would constrain the to lie on a simplex. This would then give us the interpretation that is the overall standard deviation if the covariates are properly scaled to be and the local parameters control how the individual parameters contribute to this variability. The standard parameterisation leads to some confounding between the scales of the local and global parameters, which can lead to both an interpretational and computational problems. Interesting Bhattacharya _et al._ showed that in some specific cases you can go from a model where the local parameters are constrained to the simplex to the unconstrained case (although they parameterised with the variance rather than the standard deviation). _Pedant's corner:_ Andrew likes define mathematical statisticians as those who use _x_ for their data rather than _y_. I prefer to characterise them by those who think it's a good idea to put a prior on variance (an unelicitable quantity) rather than standard deviation (which is easy to have opinions about). Please people _just stop doing this. You're not helping yourselves!_ Actually, maybe that last point isn't for Pedant's Corner after all. Because if you parameterise by standard deviation it's pretty easy to work out what the marginal prior on (with fixed) is. This is quite useful because, with the notable exception of the "Bayesian" "Lasso" which-does-not-work-but-will-never-die-because-it-was-inexplicably-published-in-the leading-stats-journal-by-prominent-statisticians-and-has-the-word-Lasso-in-the-title-even-though-a-back-of-the-envelope-calculation-or-I-don't-know-a-fairly-straightforward-simulation-by-the-reviewers-should-have-nixed-it (to use its married name), we can't compute the marginal prior for most scale-mixtures of normals. The following result, which was killed by reviewers at some point during the PC prior papers long review process, but lives forever on the V1 paper on arXiv tells you everything you need to know. It's a picture because frankly I've had a glass of wine and I'm not bloody typing it all again. For those of you who don't want to click through, it basically says the following: * If the density of the prior on the standard deviation is finite at zero, then the implied prior on has a logarithmic spike at zero. * If the density of the prior on the standard has a polynomial tail, then the implied prior on has the same polynomial tail. (Not in the result, but computed at the time: if the prior on the standard deviation is exponential, the prior on still has Gaussian-ish tails. I couldn't work out what happened in the hinterland between exponential tails and polynomial tails, but I suspect at some point the tail on the standard deviation does eventually get heavy enough to be seen in the marginal, but I can't tell you when.) With this sort of information, you can compute the equivalent of the bounds that I did on the Laplace prior for the general case (or, actually, for the case that will have at least a little bit of a chance, which is the monotonically decreasing priors on the standard deviation). Anyway, that's a long way around to say that you get similar things for all computationally useful models of sparsity. Why? Well basically it's because these models are a dirty hack. They don't allow us to represent exactly sparse signals, so we need to deal with that somehow. The somehow is through some sort of decision process that can tell a zero from a non-zero. Unfortunately, this decision is going to depend on the precision of the measurement process, which strongly indicates that it will need to know about things like , and . One way to represent this is through an interaction wiht the prior. _You'd look better if your shadow didn't follow you around, but it looks as though you're tethered to the ground, just like every pound of flesh I've ever found. – Franklin Bruno in a sourceless light._ For a very simple decision process (the deterministic threshold process described in the previous post), you can work out exactly how the threshold needs to interact with the prior. In particular, we can see that if we're trying to detect a true signal that is exactly zero (no components are active), then we know that . This is not possible for these scale-mixture models, but we can require that in this case all of the components are at most , in which case , which suggests we want . The calculation in the previous post shows that if we want this sort of almost zero signal to have any mass at all under the prior, we need to scale using information about . Of course, this is a very very simple decision process. I have absolutely no idea how to repeat these arguments for actually good decision processes, like the predictive loss minimization favoured by Aki. But I'd still expect that we'd need to make sure there was _a priori_ enough mass in the areas of the parameter space where the decision process is firmly one way or another (as well as mass in the indeterminate region). I doubt that the Bayesian Lasso would magically start to work under these more complex losses. MODELS SPECIFIED THROUGH THEIR FULL CONDITIONALS _Why won't you cheat with me? You and I both know that he's done the same. _– Jenny Toomey sings Franklin Bruno__ So we can view the design dependence of sparsity priors as preparation for the forthcoming decision process. (Those of you who just mentally broke into Prepare Ye The Way Of The Lord from Godspell, please come to the front of the class. You are my people.) Now let's talk about a case where this isn't true. To do this, we need to cast our minds back to a time when people really did have the original cast recording of Godspell on their mind. In particular, we need to think about Julian Besag (who I'm sure was really into musicals about Jesus. I have no information to the contrary, so I'm just going to assume it's true.) who wrote a series of important papers, one in 1974 and one in 1975 (and several before and after, but I can't be arsed linking to them all. We all have google.) about specifying models through conditional independence relations. These models have a special place in time series modelling (where we all know about discrete-time Markovian processes) and in spatial statistics. In particular, generalisations of Besag's (Gaussian) conditional autoregressive (CAR) models are widely used in spatial epidemiology. Mathematically, Gaussian CAR models (and more generally Gaussian Markov models on graphs) are defined through their _precision _matrix, that is the inverse of the covariance matrix as . For simple models, such as the popular CAR model, we assume is fixed, known, and sparse (i.e. it has a lot of zeros) and we typically interpret to be the inverse of the variance of . This interpretation of could not be more wrong. Why? Well, let's look at the marginal . To interpet and the inverse variance, we need the diagonal elements of to all be around 1. _This is never the case._ A simple, mathematically tractable example is the first order random walk on a one-dimensional lattice, which can be written in terms of the increment process as . Conditioned on a particular starting point, this process looks a lot like a discrete version of Brownian motion as you move the lattice points closer together. This is a useful model for rough non-linear random effects, such as the baseline hazard rate in a Cox proportional hazard model. A long and detailed (and quite general) discussion of these models can be found in Rue and Held's book. I am bringing this case up because you can actually work out the size of the diagonal of . Sørbye and Rue talk about this in detail, but for this model maybe the easiest way to understand it is that if we had a fixed lattice with $n$ points and we'd carefully worked out a sensible prior for . Now imagine that we've gotten some new data and instead of only points in the lattice, we got information at a finer scale, so now the same interval is covered by equally spaced nodes. We model this with the new first order random walk prior . It turns out that we can relate the inverse variances of these two increment processes as $\tau' = k \tau$. This _strongly_ suggests that we should not use the same prior for as we should for , but that the prior should actually know about how many nodes there are on the lattice. Concrete suggestions are in the Sørbye and Rue paper linked above. _Not to coin a phrase, but play it as it lays – Franklin Bruno in Nothing Painted Blue_ This type of design dependence is a general problem for multivariate Gaussian models specified through their precision (so-called Gaussian Markov random fields). The critical thing here is that, unlike the sparsity case, the design dependence does not come from some type of decision process. It comes from the _gap _between the_ parameterisation _(in terms of and ) and the _elicitable quantity _(the scale of the random effect). GAUSSIAN PROCESS MODELS _And it's not like we're tearing down a house of more than gingerbread. It's not like we're calling down the wrath of heaven on our heads. – _ Jenny Toomey sings Franklin Bruno__ So the design dependence doesn't necessarily come in preparation for some kind of decision, it can also be because we have constructed (and therefore parameterised) our process in an inconvenient way. Let's see if we can knock out another one before my bottle of wine dies. Gaussian processes, the least exciting tool in the machine learner's toolbox, are another example where your priors _need _to be design dependent. It will probably surprise you not a single sausage that in this case the need for design dependence comes from a completely different place. For simplicity let's consider a Gaussian process in one dimension with isotropic covariance function . This is the commonly encountered Whittle-Matérn family of covariance functions. The distinguished members are the exponential covariance function when and the squared exponential function , which is the limit as . One of the inconvenient features of Matérn models in 1-3 dimensions is that it is impossible to consistently recover all of the parameters by simply observing more and more of the random effect on a fixed interval. You need to see new replicates in order to properly pin these down. So one might expect that this non-identifiability would be the source of some problems. One would be wrong. The squared exponential covariance function does not have this pathology, but it's still very very hard to fit. Why? Well the problem is that you can interpret $\kappa$ as an inverse-range parameter. Roughly, the interpetation is that if then the value of is approximately independent of the value of . This means that a fixed data set provides _no information_ about in large parts of the parameter space. In particular if is bigger than the range of the measurement locations, then the data has no information about the parameter. Similarly, if is smaller than the smallest distance between two data points (or for irregular data, this should be something like "smaller than some low quantile of the set of distances between points"), then the data will have nothing to say about the parameter. Of these two scenarios, it turns out that the inference is much less sensitive to the prior on small values of (ie ranges longer than the data) than it is on small values of (ie ranges shorter than the data). Currently, we have two recommendations: one based around PC priors and one based around inverse gamma priors (the second link is to the Stan manual). But both of these require you to specify the design-dependent quantity of a "minimum length scale we expect this data set to be informative about". Betancourt has some lovely Stan case studies on this that I assume will migrate to the mc-stan.org/ website eventually. _I'm a disaster, you're a disaster, we're a disaster area. – Franklin Bruno in The Human Hearts (featuring alto extraordinaire and cabaret god Ms Molly Pope)_ So in this final example we hit our ultimate goal. A case where design dependent priors are needed not because of a hacky decision process, or an awkward specification, but due to the limits of the data. In this case, priors that do not recognise the limitation of the design of the experiment will lead to poorly behaving posteriors. In this case, it manifests as the Gaussian processes severely over-fitting the data. This is the ultimate expression of the point that we tried to make in the Entropy paper: _The prior can often only be understood in the context of the likelihood._ PRINCIPLES CAN ONLY GET YOU SO FAR _I'm making scenes, you're constructing dioramas – Franklin Bruno in Nothing Painted Blue_ Just to round this off, I guess I should mention that the strong likelihood principle really does suggest that certain details of the design are not relevant to a fully Bayesian analysis. In particular, if the design only pops up in the normalising constant of the likelihood, it should not be relevant to a Bayesian. This seems at odds with everything I've said so far. But it's not. In each of these cases, the design was only invoked in order to deal with some external information. For sparsity, design was needed to properly infer a sparse signal and came in through the structure of the decision process. For Gaussian processes, the same thing happened: the implicit decision criterion was that we wanted to make _good predictions_. The design told us which parts of the parameter space obstructed this goal, and a well specified prior removed the problem. For the CAR models, the external information was that the elicitable quantity was the marginal standard deviation, which was a complicated function of the design and the standard parameter. There are also any number of cases in real practice where the decision at hand is stochastically dependent on the data gathering mechanism. This is why things like MRP exist. I guess this is the tl;dr version of this post (because apparently I'm too wordy for some people. I suggest they read other things. Of course suggesting this in the final paragraph of such a wordy post is very me.): _Design matters even if you're Bayesian. Especially if you want to do something with your posterior that's more exciting than just sitting on it._ The post Why won't you cheat with me? appeared first on Statistical Modeling, Causal Inference, and Social Science.

Retraction Watch linked to this paper, "Publication bias and the canonization of false facts," by Silas Nissen, Tali Magidson, Kevin Gross, and Carl Bergstrom, and which is in the Physics and Society section of Arxiv which is kind of odd since it has nothing whatsoever to do with physics. Nissen et al. write:

In the process of scientific inquiry, certain claims accumulate enough support to be established as facts. Unfortunately, not every claim accorded the status of fact turns out to be true. In this paper, we model the dynamic process by which claims are canonized as fact through repeated experimental confirmation. . . . In our model, publication bias--in which positive results are published preferentially over negative ones--influences the distribution of published results.I don't really have any comments on the paper itself--I'm never sure when these mathematical models are adding to our understanding of social processes, and when they're just adding confusion--but there was a side point to be made, and this is that there's selection bias within, as well as between, publications. Indeed, I suspect the "within" bias is larger than the "between." Consider two scenarios: 1. Within-publication bias: A researcher studies topic X, gathers data, does what it takes to get a publication. There's a bias toward finding statistical significance, a bias toward overestimating effect sizes, a bias toward overconfidence in conclusions, and a bias toward finding something ideologically appealing to the researcher. 2. Between-publication bias: A researcher, or set of researchers, work on a series of projects. Some are successful and some are not. The successful studies get published. This results in the same biases as before. Both scenarios happen, but I suspect that scenario 1, within-publication bias, is more important. I don't think researchers have so many papers in their file drawers, and I also think that they have lots of degrees of freedom allowing them to find success in their data. I wrote the above post because I worry that when people talk about publication bias, they're thinking too much about the publication/nonpublication decision, and not enough about all the bias that goes into what people decide to report at all. The post The Night Riders appeared first on Statistical Modeling, Causal Inference, and Social Science.

Ed Yong writes:

Over the past decade, social psychologists have dazzled us with studies showing that huge social problems can seemingly be rectified through simple tricks. A small grammatical tweak in a survey delivered to people the day before an election greatly increases voter turnout. A 15-minute writing exercise narrows the achievement gap between black and white students—and the benefits last for years. “Each statement may sound outlandish--more science fiction than science,” wrote Gregory Walton from Stanford University in 2014. But they reflect the science of what he calls “wise interventions” . . . They seem to work, if the stream of papers in high-profile scientific journals is to be believed. But as with many branches of psychology, wise interventions are taking a battering. A new wave of studies that attempted to replicate the promising experiments have found discouraging results. At worst, they suggest that the original successes were mirages. At best, they reveal that pulling off these tricks at a large scale will be more challenging than commonly believed.Well put. Yong gives an example:

Consider a recent study by Christopher Bryan (then at Stanford, now at University of Chicago), along with Walton and others. During the 2008 U.S. presidential election, they sent a survey to 133 Californian voters. Some were asked: “How important is it to you to vote in the upcoming election?” Others received the same question but with a slight tweak: “How important is it to you to be a voter in the upcoming election?” Once the ballots were cast, the team checked the official state records. They found that 96 percent of those who read the “be a voter” question showed up to vote, compared to just 82 percent of those who read the “to vote” version. A tiny linguistic tweak led to a huge 14 percentage point increase in turnout. The team repeated their experiment with 214 New Jersey voters in the 2009 gubernatorial elections, and found the same large effect: changing “vote” to “be a voter” raised turnout levels from 79 percent to 90 percent.Wow! Sounds pretty impressive. But should we really trust it? Yong continues:

When Alan Gerber heard about the results, he was surprised. As a political scientist at Yale University, he knew that previous experiments involving thousands of people had never mobilized voters to that degree. Mail-outs, for example, typically increase turnout by 0.5 percentage points, or 2.3 if especially persuasive. And yet changing a few words apparently did so by 11 to 14 percentage points. . . . So he repeated Bryan’s experiment. His team delivered the same survey to 4,400 voters in days leading up to the 2014 primary elections in Michigan, Missouri, and Tennessee. And they found that using the noun version instead of the verb one had no effect on voter turnout. None. Their much larger study, with 20 to 33 times the participants of Bryan’s two experiments, completely failed to replicate the original effects. Melissa Michelson, a political scientist at Menlo College, isn’t surprised. She was never quite convinced about how robust Bryan’s results were, or how useful they would be. . . . Jan Leighley from American University agrees. The small sample size of the original study “would have tanked the paper from consideration in a serious political science journal,” she says.That last bit is funny because the paper in question appeared in . . . you guessed it, PNAS! It's tough being a social scientist: work that's not strong enough to appear in own journals, gets into Science or Nature or PNAS, where it gets more publicity than anything in APSR, AJPS, etc. I looked at the Bryan et al. paper and it does have some issues. First, the estimated effect sizes are huge and strain plausibility, which implies high rates of type M and type S errors. Second, there are some forking paths. For example:

Because the distribution of reported interest in registering to vote was negatively skewed (Z = −2.78, P = 0.005), the variable was reflected and then square-root transformed, which reduced skew to nonsignificance (Z = −1.76, P = 0.078). A t test on the transformed variable yielded a significant condition difference [t(32) = 2.10, P = 0.044]. Analysis of the untransformed variable also yields a significant result [(Mnoun = 4.44; Mverb = 3.39; 1 = “not at all interested,” 5 = “extremely interested”), t(32) = 2.23, P = 0.033]. A significant Levene’s test indicated that there was less variance in the noun condition than in the verb condition [F(1,32) = 6.02, P = 0.020]. This appeared to be the case because of a ceiling effect in the noun condition, where 62.5% of participants were at the highest point on the scale (compared with 38.9% in the verb condition). Adjusting for this, the significance level of the condition effect strengthened slightly [t(29.40) = 2.15, P = 0.040]. In addition, a separate χ2 analysis, which does not rely on the assumption of the equality of variance, found that more participants indicated that they were “very” or “extremely” interested in registering to vote (as opposed to “not at all,” “a little,” or “somewhat” interested in registering to vote) in the noun condition (87.5%) than in the verb condition (55.6%) [χ2(1, n = 34) = 4.16, P = 0.041].From one direction, this looks pretty good: they tried the analysis all sorts of ways and always got statistical significance! From another perspective, though, we see all these researcher degrees of freedom lurking around. For example, the significance level "strengthened" at one point: this was a meaningless change from 0.044 to 0.040. Actually, even p-values of 0.1 and 0.01 are not statistically significantly different from each other! The point is that there are so many different ways to slice this cake and they keep reporting those p-values of 0.03 or 0.04 or whatever. There's nothing wrong with data exploration--I'm the last person to insist on preregistration--and I find this paper more interesting than the ESP paper, or the ovulation-and-voting paper, or the fat arms paper, or the beauty-and-sex-ratio paper, or lots of other silly stuff we've posted on in the past. But I can't take these p-values and effect-size estimates seriously. Yong discusses the Bryan et al. and Gerber et al. papers further, and summarizes:

And no matter who is right, it is clear that these wise interventions are not the simple tricks they’re made out to be.Indeed. And remember the time-reversal heuristic. The post The time reversal heuristic (priming and voting edition) appeared first on Statistical Modeling, Causal Inference, and Social Science.

A couple months ago, Uri Simonsohn posted online a suggested statistical method for detecting nonmonotonicity in data. He called it: "Two-lines: The First Valid Test of U-Shaped Relationships."
With a title like that, I guess you're asking for it. And, indeed, awhile later I received an email from Yair Heller identifying some problems with Uri's method. After checking with Yair, I forwarded his message to Simonsohn who found the problem and fixed it. Uri's update is here.
Now, I don't actually agree with Uri _or_ Yair on this one: I don't really buy the hypothesis-testing, type-1-error framework that they're using. But that's ok: it's not my job to vet their methods. If these ideas are useful to others, great.
My real point here is that post-publication review really worked. Uri posted something, Yair had a criticism, Uri responded. And if Yair's criticism had been fatal, I'm pretty sure Uri would've acknowledged it. Cos that's how we roll. Open discussion, open data, open code. It's what science is all about.
The post Post-publication review succeeds again! (Two-lines edition.) appeared first on Statistical Modeling, Causal Inference, and Social Science.

I came across this post by blogger Echidne slamming psychology professor Roy Baumeister. I'd first heard about the Baumeister in the context of his seeming inability to handle scientific criticism. I hadn't realized that Baumeister had a sideline in pseudoscientific anti-political-correctness.
One aspect of all this that interests me is the way that Baumeister, and other scholars like him, seem to take some of the worst of the traditional left and right. From the 60's-style left, you get a kind of mystical attitude that reality isn't important, a mind-over-matter perspective exemplified by his view that "flair" and "intuition" are more important than boring number crunching. From the right (or, I guess we say now, the "alt-right"), there's science-style justifications of traditional sex roles, racial inequality, and a general feeling that rich people deserve to keep what they have.
Or, to move to political science, the claims that elections are decided by shark attacks and college football games can be given a leftist spin--We're not really living in a democracy! Voters are being manipulated!--or can be taken to support rightist positions: If voters are really so easily distracted, maybe the scope of democracy should be restricted to people such as business owners who have a real stake in the system.
In discussing the straddling of left/right themes of business school professor and plagiarist Karl Weick, here's what Thomas Basbøll and I had to say:

Unmoored to its original source, the story gets altered by the tellers so that it can be used to make any point that people want to make from it. We conclude with some comments on political ideology. Storytelling has been championed by a wide range of scholars who would like to escape the confines of rigor. On the academic left, storytelling is sometimes viewed as a humane alternative to the impersonal number crunching of economists, while the academic right uses stories to connect to worldly business executives who have neither the time nor patience for dry scholasticism. Karl Weick seems to us to express an unstable mix of these attitudes, championing the creative humanism of story-based social reasoning while offering his theories as useful truths for the business world. And indeed he may be correct in both these views . . .Maybe it's nothing special, it's just the usual story that people will use the tropes available for them. After all, why should someone have to be on the political left to feel like a 60's-style rebel? Cultural commentators ranging from P. J. O'Rourke (on the right) to Thomas Frank (on the left) have been talking about this sort of thing for decades. So I'm not sure where this leads. The post Pseudoscience and the left/right whiplash appeared first on Statistical Modeling, Causal Inference, and Social Science.

StanCon is happening at the beautiful Asilomar conference facility at the beach in Monterey California for three days starting January 10, 2018. We have space for 200 souls and this will sell out.
If you don't already know, Stan is the rising star of probabilistic modeling with Bayesian analysis. If you do statistics, machine learning or data science then you need to know about Stan.
StanCon offers a full schedule of invited talks, submitted papers, and tutorials unavailable in any other format. Balancing the intellectual intensity of cutting edge statistical modeling are fun activities like indoor R/C airplane building/flying/designing and non-snobby blind wine tasting for after dinner activities. We will have the first ever "wear your poster" reception-see the call for posters below. And no parallel sessions-you get the entire StanCon2018, not a slice.
Go to http://mc-stan.org/events/stancon2018 and register.
INVITED TALKS
* Andrew Gelman
Department of Statistics and Political Science, Columbia University
* Susan Holmes
Department of Statistics, Stanford University
* Frank Harrell, Jr.
School of Medicine and Department of Biostatistics, Vanderbilt University
* Sophia Rabe-Hesketh
Educational Statistics and Biostatistics, University of California, Berkeley
* Sean Taylor and Ben Letham
Facebook Core Data Science
* Manuel Rivas
Department of Biomedical Data Science, Stanford University
* Talia Weiss
Department of Physics, Massachusetts Institute of Technology
These rock stars have agreed to leave their entourages, groupies and bad habits at home and will start their ~~shows~~ talks on time and leave you wanting more.
SUBMITTED TALKS:
We have 18 accepted talks ranging from public policy viewed through Bayesian analysis to painful theory papers. And we have Facebook, and space people from NASA. Talks are self-contained knitr or Jupyter notebooks that will be made publicly available after the conference.
TUTORIALS
We have tutorials that start at the crack of 8am for those desiring further edification beyond the awesome program. Total time ranges from 6 hours to 1 hour depending on topic—these will be parallel but don’t conflict with the main conference.
* Introduction to Stan
Know how to program? Know basic statistics? Curious about Bayesian analysis and Stan? This is the course for you. Hands on, focused and an excellent way to get started working in Stan. 2 hours every morning 8am to 10am.
* Executive decision making the Bayesian way
This is for nontechnical managers to learn the core of decision making under uncertainty and how to interpret the talks that they will be attending the rest of the day. 1 hour/day every day.
* Advanced Modeling in Stan
The hard stuff led by the best of the best. Very interactive, very intense. Varying topics, every day 1-2 hours.
POSTER CALL FOR PARTICIPATION
We will take poster submissions on a rolling basis until December 5th. One page exclusive of references is the desired format but anything that gives us enough information to make a decision is fine. We will accept/reject within 48 hours. Send to stancon2018@mc-stan.org.
The only somewhat odd requirement is that your poster must be "wearable" to the 5pm reception where you will be a walking presentation. Great way to network, signboard supplies will be available so you need only have sheets of paper which can be attached to signboard material which coincidentally will be the source airframe material for the R/C airplane activities following dinner.
FUN STUFF
Learning is fun but we anticipate that blowing off a little steam will be called for.
* R/C Airplanes
After dinner on day 1 we will provide designs and building materials to create your own R/C airplane. The core design can be scratch built in 90 minutes or less at which point, and weather dependent, we will learn to fly our planes indoors or outdoors. See http://brooklynaerodrome.com for an idea of the style of airplane. You can also create your own designs and we will have night illumination gear.
* Snob-free Blind Wine Tasting
By day 2 you will have gotten to know your fellow attendees so some social adventure is called for. This activity has proved wildly successful at DARPA conferences and they invented the internet so it can't be all bad. Participants taste wines without knowing what they are.
That's it! StanCon2018 is going to be a pressure cooker of learning and fun. Don't miss it.
EARLY REGISTRATION
Early bird registration ends 10 NOVEMBER 2017.
Go to http://mc-stan.org/events/stancon2018 and register.
StanCon Organizing Committee
The post StanCon2018 Early Registration ends Nov 10 appeared first on Statistical Modeling, Causal Inference, and Social Science.

We had some discussion yesterday about this Gallup poll that asked respondents to guess the percentage of Americans who are gay. The average response was 23%--and this stunningly high number was not just driven by outliers: more than half the respondents estimated the proportion gay as 20% or more.
All this is in stark contrast to direct estimates from surveys that 3 or 4% of Americans are gay.
One thing that came up in comments is that survey respondents with minority sexual orientations might not want to admit it. So maybe that 3-4% is an underestimate. Here's an informative news article by Samantha Allen which suggests that the traditionally-cited 10% number might not be so far off.
But even if the real rate is 10% (including lots of closeted people), that's still much less than the 23% from the survey.
Or, to look at it another way, it's no surprise that respondents grossly overestimate the rate of gays, given that they also grossly overestimate the rate of African Americans, Hispanic Americans, Asian Americans, and immigrants in this and other countries. Or similar overestimates if you ask people what fraction of the Federal budget goes to various well-known but relatively tiny programs.
From this perspective, the problem doesn't seem to have anything to do with gays but rather a more general lack of numeracy, a lack of understanding about small proportions.
Suppose we wanted to ask the survey question in a way to elicit more accurate responses. One way would be to break up the majority category into subgroups. For example, what percentage of American adults are: Straight and married; Straight, unmarried, in a committed relationship; Straight, never married, and single; Straight and divorced; Straight and widowed; Gay. I'm sure these categories could be phrased better, also you have to figure out what to do with various intermediate categories. The point is that it could make a difference if you set up enough alternative categories.
Similarly for the ethnicities. What if, instead of asking black/Hispanic/Asian/white/other, you ask something like this: black, Hispanic, Asian, American Indian, English, Irish, German, Italian, Polish, Russian, etc.? I don't know how this would go, but I could see it making a difference.
On the other hand, you shouldn't _have_ to do it that way to get a reasonable answer.
Another way would be to take a more direct approach: Ask the survey respondent about his or her 10 or 20 closest family members and friends: How many are gay, etc.? (You'd want them to identify the 10 or 20 people first, before saying why you're asking, otherwise it would be too easy to recall one or more gay person.) Then you could ask about the rest of the population: Do you think you know many more gay people, more gay people, about the same, fewer gay people, or many fewer gay people, compared to the average American.
The point here is not to trick respondents into giving more accurate answers but rather to connect population questions of interest to respondents' direct experience.
I wonder what Gerd Gigerenzer thinks about all this.
P.S. David Landy writes:

I saw your post today on overestimations of LGBT individuals, and its supposed relationship to innumeracy. Funnily enough, I’ve been meaning to write you on that specific topic (which you’ve posted on before)! I think your interpretation, which is the common one, is either wrong or simply too vague: My colleagues and I published a paper on this topic this year, and have one more in the works*. The short version is that we argue that these overestimations are a direct result of individual-level psychophysical response curves: they are a part of how people estimate proportions of any kind, and have little or nothing to do with people’s perceptions of particular subgroups. * Here’s a dropbox link to a PDF version of the individual-level data and analyses, that confirm the pattern of responses within single individuals, for a wider variety of demographic items, including LGBT populations. It’s a lot of slides (but just a 20 minute talk)—but the most important/convincing are the data and models on pages 58-62). Second *: Also in a rank ordering task we conducted, Mechanical Turkers actually seem to underestimate the LGBT population, specifically. This is important because rank ordering distinguishes between general psychophysical biases and specific population-specific misconceptions. Again, we’re prepping for publication, but here’s the key figure:The post More thoughts on that "What percent of Americans would you say are gay or lesbian?" survey appeared first on Statistical Modeling, Causal Inference, and Social Science.

"_And then there was Yodeling Elaine, the Queen of the Air. She had a dollar sign medallion about as big as a dinner plate around her neck and a tiny bubble of spittle around her nostril and a little rusty tear, for she had lassoed and lost another tipsy sailor_"– Tom WaitsIt turns out I turned thirty two and became unbearable. Some of you may feel, with an increasing sense of temporal dissonance, that I was already unbearable. (Fair point) Others will wonder how I can look so good at my age. (Answer: Black Metal) None of that matters to me because all I want to do is talk about the evils of marketing like the 90s were a vaguely good idea. (Narrator: "They were not. The concept of authenticity is just another way for the dominant culture to suppress more interesting ones.") The thing is, I worry that the real problem in academic statistics in 2017 is not a reproducibility crisis, so much as that so many of our methods just don't work. And to be honest, I don't really know what to do about that, other than suggest that we tighten our standards and insist that people proposing new methods, models, and algorithms work harder to sketch out the boundaries of their creations. (What a suggestion. Really. Concrete proposals for concrete change. But it's a blog. If ever there was a medium to be half-arsed in it's this one. It's like twitter for people who aren't pithy.) Berätta För Mig Om Det är Sant Att Din Hud är Doppad I Honung So what is the object of my impotent ire today. Well nothing less storied than the Bayesian Lasso. It should be the least controversial thing in this, the year of our lord two thousand and seventeen, to point out that this method bears no practical resemblance to the Lasso. Or, in the words of Law and Order: SVU, "The [Bayesian Lasso] is fictional and does not depict any actual person or event". Who Do You Think You Are? The Bayesian Lasso is a good example of what's commonly known as the _Lupita Nyong'o fallacy_, which goes something like this: Lupita Nyong'o had a break out role in Twelve Years a Slave, she also had a heavily disguised role in one of the Star Wars films (the specific Star Wars film is not important. I haven't seen it and I don't care). Hence Twelve Years a Slave exists in the extended Star Wars universe. The key point is that the (classical) Lasso plays a small part within the Bayesian Lasso (it's the MAP estimate) in the same way that Lupita Nyong'o played a small role in that Star Wars film. But just as the presence of Ms Nyong'o does not turn Star Wars into Twelve Years a Slave, the fact that the classical Lasso can be recovered as the MAP estimate of the Bayesian Lasso does not make the Bayesian Lasso useful. And yet people still ask if they can be fit in Stan. In that case, Andrew answered the question that was asked, which is typically the best way to deal with software enquiries. (It's usually a fool's game to try to guess why people are asking particular questions. It probably wouldn't be hard for someone to catalogue the number of times I've not followed my advice on this, but in life as in statistics, consistency is really only a concern if everything else is going well.) But I am brave and was not asked for my opinion, so I'm going to talk about why the Bayesian Lasso doesn't work. Hiding All Away So why would anyone not know that the Bayesian Lasso doesn't work? Well, I don't really know. But I will point out that all of the results that I've seen in this directions (not that I've been looking hard) have been published in the prestigious but obtuse places like Annals of Statistics, the journal we publish in when we either don't want people without a graduate degree in mathematical statistics to understand us or when we want to get tenure. By contrast, the original paper is very readable and published in JASA, where we put papers when we are ok with people who do not have a graduate degree in mathematical statistics being able to read them, or when we want to get tenure. To be fair to Park and Casella, they never really say that the Baysian Lasso should be used for sparsity. Except for one sentence in the introduction where they say the median gives approximately sparse estimators and the title which links it to the most prominent and popular method for estimating a sparse signal. Marketing eh. (See, I'm Canadian now). The Devil Has Designed My Death And Is Waiting To Be Sure So what is the Bayesian LASSO (and why did I spend 600 words harping on about something before defining it? The answer will shock you. Actually the answer will not shock you, it's because it's kinda hard to do equations on this thing.) For data observed with Gaussian error, the Bayesian Lasso takes the form where, instead of putting a Normal prior on we put independent Laplace priors Here the tuning parameter where is the number of covariates, is the number of observations, is the number of "true" non-zero elements of , is known, and is an unknown scaling parameter that should be . _Important Side note: _This isn't the exact same model as Park and Castella used as they didn't use the transformation but rather just dealt with as the parameter. From a Bayesian viewpoint, it's a much better idea to put a prior on than on directly. Why? Because a prior on needs to depend on _n_, _p_, , and _X _and hence needs to be changed for each problem, while a prior on can be used for many problems. One possible option is , which is a rate optimal parameter for the (non-Bayesian) Lasso. Later, we'll do a back-of-the-envelope calculation that suggests we probably don't need the square root around the logarithmic term. EDIT: I've had some questions about this scaling, so I'm going to try to explain it a little better. The idea here is that the Bayesian Lasso uses the i.i.d. Laplace priors with scaling parameter on to express the _substantive belief_ that the signal is approximately sparse. The reason for scaling the prior is that not every value of is consistent with this belief. For example, will not give an approximately sparse signal. While we could just use a prior for that has a very heavy right tail (something like an inverse gamma), this is at odds with a good practice principle of making sure all of thee parameters in your models are properly scaled to make them order 1. Why do we do this? Because it makes it much much easier to set sensible priors. _Other important side note: _Some of you may have noticed that the scaling can depend on the unknown sparsity . This seems like cheating. People who do asymptotic theory call this sort of value for an _oracle _value, mainly because people studying Bayesian asymptotics are really _really_ into databasing software. The idea is that this is the value of that gives the model the best chance of working. When maths-ing, you work out the properties of the posterior with the oracle value of and then you use some sort of smoothness argument to show that the _actual_ method that is being used to select (or average over) the parameter gives almost the same answer. Only Once In Sheboygan. Only Once. So what's wrong with the Bayesian Lasso? Well the short version is that the Laplace prior doesn't have enough mass near zero relative to the mass in the tails to allow for a posterior that has a lot of entries that are almost zero and some entries that are emphatically not zero. Because the Bayesian Lasso prior does not have a spike at zero, none of the entries will be _a priori _ exactly zero, so we need some sort of rule to separate the "zero" entries from the "non-zero" entries. The way that we're going to do this is to choose a cutoff where we assume that if , then . So how do we know that the Lasso prior doesn't put enough mass in important parts of the parameter space? Well there are two ways. I learnt it during the exciting process of writing a paper that the reviewers insisted should have an extended section about sparsity (although this was at best tangential to the rest of the paper), so I suddenly needed to know about Bayesian models of sparsity. So I read those Annals of Stats papers. (That's why I know I should be scaling !). What are the key references? Well all the knowledge that you seek is here and here. But a much easier way to work out that the Bayesian Lasso is bad is to do some simple maths. Because the are _a priori_ independent, we get a prior on the effective sparsity For the Bayesian Lasso, that probability can be computed as . Ideally, the distribution of this effective sparsity would be centred on the true sparsity. That is, we'd like to choose so that . A quick re-arrangement suggests that . Now, we are interested in signals with , i.e. where only a very small number of the are non-zero. This suggests we can safely ignore the second term. Some other theory suggests that we need to take to depend on in such a way that it goes to zero faster than as gets large. (EDIT: Oops - this paper used and . For the general case, their reasoning leads to .) This means that we need to take in order to ensure that we have our prior centred on sparse vectors (in the sense that the prior mean for the number of non-zero components is always much less than ). Show Some Emotion So for the Bayesian Lasso, a sensible parameter is , which will usually have a large number of components less than the threshold and a small number that are larger. But this is still not any good. To see this, let's consider the prior probability of seeing a larger than one: . This is the problem with the Bayesian Lasso: _in order to have a lot of zeros in the signal, you are also forcing the non-zero elements to be very small_. A plot of this function is above, and it's clear that even for very small values of this probability is _infinitesimally small_. Basically, the Bayesian Lasso can't give enough mass to both small and large signals simultaneously. Other Bayesian models (such as the horseshoe and the Finnish horseshoe) can support both simultaneously and this type of calculation can show that (although it's harder. See Theorem 6 here). _Side note: _The scaling that I derived in the previous section is a little different to the standard Lasso scaling of , but the same result holds: for large the probability of seeing a large signal is vanishingly small. Maybe I Was Mean, But I Really Don't Think So Now, obviously this is not what we see with real data when we fit the Bayesian Lasso in Stan with an unknown . What happens is the model tries to strike a balance between the big signals and the small signals, shrinking the former too much, and letting the latter be too far from zero. You will see this in the event that you try to fit the Bayesian Lasso in Stan. UPDATE: I've put some extra notes in the above text to make it hopefully a little clearer in one case and a little more correct in the other. Please accept this version of the title track as way of an apology (Also this stunning version, done wearing even more makeup) The post The king must die appeared first on Statistical Modeling, Causal Inference, and Social Science.

[cat picture]
For the chapter in Regression and Other Stories that includes nonlinear regression, I'd like a couple homework problems where the kids have to construct and fit models to real data. So I need some examples. We already have the success of golf putts as a function of distance from the hole, and I'd like some others. One thing that came to mind today, because I happened to see a safety warning poster on the bus reminding people not to drive too fast, is data on speed and traffic accidents.
But I'm interested in other examples too. Just about anything interesting with data on x and y where there's no simple linear relation, and where log and logit transformations don't do the trick either. The outcome can be discrete or continuous, either way.
There's gotta be lots and lots of good examples but for some reason I'm drawing a blank. So I could use your help. I need not just the examples but also the data.
Thanks! Anyone who comes up with a good suggestion will be rewarded with a free Stan sticker.
The post Looking for data on speed and traffic accidents--and other examples of data that can be fit by nonlinear models appeared first on Statistical Modeling, Causal Inference, and Social Science.

Blake McShane sent along this paper by himself and David Gal, which begins:

In light of recent concerns about reproducibility and replicability, the ASA issued a Statement on Statistical Significance and p-values aimed at those who are not primarily statisticians. While the ASA Statement notes that statistical significance and p-values are “commonly misused and misinterpreted,” it does not discuss and document broader implications of these errors for the interpretation of evidence. In this article, we review research on how applied researchers who are not primarily statisticians misuse and misinterpret p-values in practice and how this can lead to errors in the interpretation of evidence. We also present new data showing, perhaps surprisingly, that researchers who are primarily statisticians are also prone to misuse and misinterpret p-values thus resulting in similar errors. In particular, we show that statisticians tend to interpret evidence dichotomously based on whether or not a p-value crosses the conventional 0.05 threshold for statistical significance. We discuss implications and offer recommendations.The article is published in the Journal of the American Statistical Association along with discussions: A p-Value to Die For, by Don Berry The Substitute for p-Values, by William Briggs Some Natural Solutions to the p-Value Communication Problem--and Why They Won’t Work, by Andrew Gelman and John Carlin Statistical Significance and the Dichotomization of Evidence: The Relevance of the ASA Statement on Statistical Significance and p-Values for Statisticians, by Eric Laber and Kerby Shedden and a rejoinder by McShane and Gal. Good stuff. Read the whole thing. For some earlier blog discussions of the McShane and Gal paper and related work, see here (More evidence that even top researchers routinely misinterpret p-values), here (Some natural solutions to the p-value communication problem--and why they won’t work), here (When considering proposals for redefining or abandoning statistical significance, remember that their effects on science will only be indirect!), and here (Abandon statistical significance). The post Statistical Significance and the Dichotomization of Evidence (McShane and Gal's paper, with discussions by Berry, Briggs, Gelman and Carlin, and Laber and Shedden) appeared first on Statistical Modeling, Causal Inference, and Social Science.

In my [Keith] previous post that criticised a publish paper, the first author commented they wanted some time to respond and I agreed. I also suggested that if the response came in after most readers have moved on I would re-post their response as a new post pointing back to the previous. So here we are.
Now there has been a lot of discussion on this blog about public versus private criticism and their cost and benefit trade offs. One change I am making is to refer to first, second or third author rather than names. Here I should also clarify that I have previously worked with the first and second author (so they are not strangers) and that the first author posted the paper on my blog post (they brought it to my attention). Now my three main points in that original post were – 1. failing to distinguish what something is versus what to make of it, 2. ignoring the ensemble of similar studies (completed, ongoing and future) and 3. neglecting important non-random errors. So when the first author brought the paper to my attention, I thought is was going to be an example of not neglecting those three things. But when I read the paper I felt it pretty much did neglect 1 and 3.
One of the main points in the response by all three authors was clarification of the goal of the paper (all other responses were just by the first author). They claimed the goal was to simply clarify that the fixed effects estimate is an estimate of _some_ population's average though not necessary one that would be of interest (depending on the context). Quoting the response "Given the didactic goal of the paper, the issue is not so much whether such a population is of interest, but just the realization that the analysis is informing us about such a population." I fully agree with that (with a caveat below). In my experience, getting an adeqaute sense of that population and generalising from it to a population of interest - is a very big stretch. The first author responded that his experience is different and especially given the epidemiologists he works with, it is often doable. Fair enough - experiences and research team expertise differ. Now my reading of the paper was that it suggested more than just that clarification and gave the impression fixed effects should be used much more and often it was more scientifically relevant. But those are just my interpretations and that can vary as does our apparent views on what is meant by scientifically relevant if not also a population. So I was expecting the paper not to fail to distinguish what something is versus what to make of it, but apparently the authors never intended to.
One of my points the first author chose not to respond to was my review of Rubin’s conceptual ideas. Again fair enough. However that is where I believe there is a serious technical disagreement. This became clearer in the first author's response e.g. "[Keith referring to] varying study quality… [first author] This is a misconception. The paper goes to considerable lengths to allow for underlying effects to differ". That is the caveat I referred to above in that fixed effects estimate is an estimate of _some_ populations average only if the between study variation is not importantly driven by design (AKA study quality or methodological) variation. This kind of variation is usually/mostly the result of haphazard biases and has different implications for what is to be made of the variation and expectation. Briefly, the variation needs to be included in the uncertainty quantification and the expectation is no longer of direct interest (more below). So while science (AKA clinical or biological) driven variation allows it to be excluded in the uncertainty quantification and the expectation is of direct interest for the average of some population. These are very different.
Fisher was one of the first to bring attention to this issue (that is why I gave the reference), the Rubin references discusses it and it is even in the Cochrane Handbook (section 9.5.1) which was edited by the second author (I believe I wrote the first draft for this section and the second author revised it). There also was full discussion of this issue at an RSS meeting in 2002. A simple example may make the issue clear. If a fixed object is measured with three measuring instruments with differing (haphazard) biases, there will be variation in the measurements that is not from the object and the average won't be estimating the object but the object plus average bias (whatever that is). With three fixed objects each measured with the same unbiased measuring instrument there will be variation in the measurements that is "real" but the average of the three objects is fixed. Here, the average measurement will be estimating the average of the three objects and the measurement variation (above the instrument's variance) need not be included in the uncertainty of the estimated average. And this would be true for any population involving various proportions of the three objects - fixed population average that the properly weighted the average would estimate it. Now one could argue the expectation with varying bias is of indirect interest being the populations true average plus some weighted combination of the biases. But then one should clearly warn of the presence of such biases even in the absence of how to address those biases. So here the paper is neglecting important non-random errors. The authors, if they agree, may wish to consider adding a note to their paper about this issue.
One of my other points the first author choose not to respond to was to work through the likelihood mechanics involved. Again fair enough. But they referred to a paper by Danyu Lin and co-author and claimed that it was pivotal and that suggested it may not be as straightforward as I thought. When I read the paper I saw exactly the likelihood mechanics being worked as I expected but there was something I had not seen before. The paper worked through the properties of a mis-specified model which is what a fixed effects model implemented with the full raw data actually is. You consider the effects to vary by study but in the model implemented with all the raw data you purposely (mis)specify the parameter as being exactly the same (give it the exact same symbol) for all studies. Now when I first started doing meta-analysis in the 1980,s the outcomes were usually binary. I had just taken a generalised linear models course, so I implemented meta-analysis using logistic regression with a common odds parameter and differing control rate parameters by study. To formally test for heterogeneity (that we explicitly argued should not be depended upon) the common odds parameter would be replaced with differing odds ratio parameters by study. (See Model 4: Partial Pooling (Log Odds) in Bob's extensive tutorial for a full Bayesian approach to this.) But I knew with binary outcomes whether I code the data by individual patient (0 or 1) versus number of failures and successes - the answers would be exactly the same (given the same parameter specifications and the magic of sufficiency) so this summary data versus raw data meta-analysis quest (given you had both) seemed completely moot to me.
So I learned something. The fixed effects model implemented with the raw data actually is mis-specified - that is wrong - but it estimates the correct average for some population (depends on sufficiency and other technicalities - but it often does). So that must have been puzzling to many and that needed to be sorted out. No doubt there is also asymptotic things that needed to be sorted out but Charlie Geyer has convinced me that such considerations are not a good use of my time.
The post What I missed on fixed effects (plural). appeared first on Statistical Modeling, Causal Inference, and Social Science.

This sort of thing is not new but it's still amusing. From a Gallup report by Frank Newport:

The American public estimates on average that 23% of Americans are gay or lesbian, little changed from Americans' 25% estimate in 2011, and only slightly higher than separate 2002 estimates of the gay and lesbian population. These estimates are many times higher than the 3.8% of the adult population who identified themselves as lesbian, gay, bisexual or transgender in Gallup Daily tracking in the first four months of this year.Newport provides some context:

Part of the explanation for the inaccurate estimates of the gay and lesbian population rests with Americans' general unfamiliarity with numbers and demography. Previous research has shown that Americans estimate that a third of the U.S. population is black, and believe almost three in 10 are Hispanic, more than twice what the actual percentages were as measured by the census at the time of the research. Americans with the highest levels of education make the lowest estimates of the gay and lesbian population, underscoring the assumption that part of the reason for the overestimate is a lack of exposure to demographic data.But there's a lot of innumeracy even among educated Americans:

Still, the average estimate among those with postgraduate education is 15%, four times the actual rate.Newport summarizes:

The estimates of gay and lesbian percentages have been relatively stable compared with those measured in 2011 and 2002, even though attitudes about gays and lesbians have changed dramatically over that time.Math class is tough. The post "Americans Greatly Overestimate Percent Gay, Lesbian in U.S." appeared first on Statistical Modeling, Causal Inference, and Social Science.

From one of our exams:

A researcher at Columbia University’s School of Social Work wanted to estimate the prevalence of drug abuse problems among American Indians (Native Americans) living in New York City. From the Census, it was estimated that about 30,000 Indians live in the city, and the researcher had a budget to interview 400. She did not have a list of Indians in the city, and she obtained her sample as follows. She started with a list of 300 members of a local American Indian community organization, and took a random sample of 100 from this list. She interviewed these 100 persons and asked each of these to give her the names of other Indians in the city whom they knew. She asked each respondent to characterize him/herself and also the people on the list on a 1-10 scale, where 10 is "strongly Indian-identified," 5 is "moderately Indian-identified," and 0 is "not at all Indian identified." Most of the original 100 people sampled characterized themselves near 10 on the scale, which makes sense because they all belong to an Indian community organization. The researcher then took a random sample of 100 people from the combined lists of all the people referred to by the first group, and repeated this process. She repeated the process twice more to obtain 400 people in her sample. Describe how you would use the data from these 400 people to estimate (and get a standard error for your estimate of) the prevalence of drug abuse problems among American Indians living in New York City. You must account for the bias and dependence of the nonrandom sampling method.There are different ways to attack this problem but my preferred solution is to use Mister P: 1. Fit a regression model to estimate p(y|X)--in this case, y represents some measure of drug abuse problem at the individual level, and X includes demographic predictors and also a measure of Indian identification (necessary because the survey design oversamples of people who are strongly Indian identified) and a measure of gregariousness (necessary because the referral design oversamples people with more friends and acquaintances); 2. Estimate the distribution of X in the population (in this case, all American Indian adults living in New York City); and 3. Take the estimates from step 1, and average these over the distribution in step 2, to estimate the distribution of y over the entire population or any subpopulations of interest. The hard part here is step 2, as I'm not aware of many published examples of such things. You have to build a model, and in that model you must account for the sampling bias. It can be done, though; indeed I'd like to do some examples of this to make these ideas more accessible to survey practitioners. There's some literature on this survey design--it's called "respondent driven sampling"--but I don't think the recommended analysis strategies are very good. MRP should be better, but, again, I should be able say this with more confidence and authority once I've actually done such an analysis for this sort of survey. Right now, I'm just a big talker. The post Using Mister P to get population estimates from respondent driven sampling appeared first on Statistical Modeling, Causal Inference, and Social Science.

Kevin Lewis points to a research article by Lawton Swan, John Chambers, Martin Heesacker, and Sondre Nero, "How should we measure Americans’ perceptions of socio-economic mobility," which reports effects of question wording on surveys on an important topic in economics. They replicated two studies:

Each (independent) research team had prompted similar groups of respondents to estimate the percentage of Americans born into the bottom of the income distribution who improved their socio-economic standing by adulthood, yet the two teams reached ostensibly irreconcilable conclusions: that Americans tend to underestimate (Chambers et al.) and overestimate (Davidai & Gilovich) the true rate of upward social mobility in the U.S.There are a few challenges here, and I think the biggest is that the questions being asked of survey respondents are so abstract. We're talking about people who might not be able to name their own representative in Congress--hey, actually I might not be able to do that particular task myself!--and who are misinformed about basic demographics, and then asking tricky questions about the distribution of income of children of parents in different income groups. Consider the survey question pictured above, from Chambers, Swan & Heesacker (2015). First off, the diagram could be misleading in that the ladder kinda makes it look like they're talking about people in the middle of the "Bottom 3rd" category, but they're asking about the average for this group. Also they ask, "what percentage of them do you think stayed in the bottom third of the income distribution (i.e., lower class), like their parents," but doesn't that presuppose that the parents stayed in this category? Lots of grad students are in the bottom third of U.S. income, and some of them have kids, but I assume that most of these parents, as well as their kids, will end up in the middle or even upper third once they graduate and get jobs. It's also not clear, when they ask about "the time those children have grown up to be young adults, in their mid-20’s," are they talking about income terciles compared to the entire U.S., or just people in their 20s? Also is it really true that the upper third of income is "upper class"? I thought that in the U.S. context you had to do better than than the top third to be upper class. Upper class people make more than $100,000 a year, right? And that's something like the 90th percentile. I'm not trying to be picky here, and I'm not trying to diss Swan et al. who are raising some important questions about survey data that regularly get reported uncritically in the news media (see, for example, here and here). I just think this whole way of getting at people's understanding may be close to hopeless, as the questions are so ill-defined and the truth so hard to know. Attitudes on inequality, social mobility, redistribution, taxation, etc., are important, and maybe there's a more direct way to study people's thoughts in this area. The post Whipsaw appeared first on Statistical Modeling, Causal Inference, and Social Science.

_Our love is like the border between Greece and Albania – The Mountain Goats_
(In which I am uncharacteristically brief)
Andrew's answer to recent post reminded me of one of my favourite questions: how do you visualise uncertainty in spatial maps. An interesting subspecies of this question relates to exactly how you can plot a contour map for a spatial estimate. The obvious idea is to take a point estimate (like your mean or median spatial field) and draw a contour map on that.
But this is problematic because it does not take into account the uncertainty in your estimate. A contour on a map indicates a line that separates two levels of a field, but if you do not know the value of the field exactly, you cannot separate it precisely. Bolin and Lindgren have constructed a neat method for dealing with this problem by having an intermediate area where you don't know which side of the chosen level you are on. This replaces thin contour lines with thick contour bands that better reflect our uncertainty.
Interestingly, using these contour bands requires us to reflect on just how certain our estimates are when selecting the number of contours we wish to plot (or else there would be nothing left in the space other than bands).
There is a broader principle reflected in Bolin and Lindgren's work: _when you are visualising multiple aspects of an uncertain quantity, you need to allow for an indeterminate region_. This is the same idea that is reflected in Andrew's "thirds" rule.
David and Finn also wrote a very nice R package that implements their method for computing contours (as well as for computing joint "excursion regions", i.e. areas in space where the random field simultaneously exceeds a fixed level with a given probability).
The post Contour as a verb appeared first on Statistical Modeling, Causal Inference, and Social Science.

I've been thinking for awhile that the default ways in which statisticians think about science--and which scientists think about statistics--are seriously flawed, sometimes even crippling scientific inquiry in some subfields, in the way that bad philosophy can do.
Here's what I think are some of the default modes of thought:
- _Hypothesis testing_, in which the purpose of data collection and analysis is to rule out a null hypothesis (typically, zero effect and zero systematic error) that nobody believes in the first place;
- _Inference_, which can work in the context of some well-defined problems (for example, studying trends in public opinion or estimating parameters within an agreed-upon model in pharmacology), but which doesn't capture the idea of learning from the unexpected;
- _Discovery_, which sounds great but which runs aground when thinking about science as a routine process: can every subfield of science really be having thousands of "discoveries" a year? Even to ask this question seems to cheapen the idea of discovery.
A more appropriate framework, I think, is _quality control_, an old idea in statistics (dating at least to the 1920s; maybe Steve Stigler can trace the idea back further), but a framework that, for whatever reason, doesn't appear much in academic statistical writing or in textbooks outside of the subfield of industrial statistics or quality engineering. (For example, I don't know that quality control has come up even once in my own articles and books on statistical methods and applications.)
Why does quality control have such a small place at the statistical table? That's a topic for another day. Right now I want to draw the connections between quality control and scientific inquiry.
Consider some thread or sub-subfield of science, for example the incumbency advantage (to take a political science example) or embodied cognition (to take a much-discussed example from psychology). Different research groups will publish papers in an area, and each paper is presented as some mix of hypothesis testing, inference, and discovery, with the mix among the three having to do with some combination of researchers' tastes, journal publication policies, and conventions within the field.
The "replication crisis" (which has been severe with embodied cognition, not so much with incumbency advantage, in part because to replicate an election study you have to wait a few years until sufficient new data have accumulated) can be summarized as:
- Hypotheses that seemed soundly rejected in published papers cannot be rejected in new, preregistered and purportedly high-power studies;
- Inferences from different published papers appear to be inconsistent with each other, casting doubt on the entire enterprise;
- Seeming discoveries do not appear in new data, and different published discoveries can even contradict each other.
In a "quality control" framework, we'd think of different studies in a sub-subfield as having many sources of variation. One of the key principles of quality control is to avoid getting faked out by variation--to avoid naive rules such as reward the winner and discard the loser--and instead to analyze and then work to reduce uncontrollable variation.
Applying the ideas of quality control to threads of scientific research, the goal would be to get better measurement, and stronger links between measurement and theory--rather than to give prominence to surprising results and to chase noise. From a quality control perspective, our current system of scientific publication and publicity is perverse: it yields misleading claims, is inefficient, and it rewards sloppy work.
The "rewards sloppy work" thing is clear from a simple decision analysis. Suppose you do a study of some effect theta, and your study's estimate will be centered around theta but with some variance. A good study will have low variance, of course. A bad study will have high variance. But what are the rewards? What gets published is not theta but the estimate. The higher the estimate (or, more generally, the more dramatic the finding), the higher the reward! Of course if you have a noisy study with high variance, your theta estimate can also be low or even negative--but you don't need to publish these results, instead you can look in your data for something else. The result is an incentive to have noise.
The above decision analysis is unrealistically crude--for one thing, your measurements can't be obviously bad or your paper probably won't get published, and you are required to present some token such as a p-value to demonstrate that your findings are stable. Unfortunately those tokens can be too cheap to be informative, so a lot of effort has to be taken to make research projects _look_ scientific.
But all this is operating under the paradigms of hypothesis testing, inference, and discovery, which I've argued is not a good model for the scientific process.
Move now to quality control, where each paper is part of a process, and the existence of too much variation is a sign of trouble. In a quality-control framework, we're _not_ looking for so-called failed or successful replications; we're looking at a sequence of published results--or, better still, a sequence of data--in context.
I was discussing some of this with Ron Kennett and he sent me two papers on quality control:
Joseph M. Juran, a Perspective on Past Contributions and Future Impact, by A. Blanton Godfrey and Ron Kenett
The Quality Trilogy: A Universal Approach to Managing for Quality, by Joseph Juran
I've not read these papers in detail but I suspect that a better understanding of these ideas could help us in all sorts of areas of statistics.
The post "Quality control" (rather than "hypothesis testing" or "inference" or "discovery") as a better metaphor for the statistical processes of science appeared first on Statistical Modeling, Causal Inference, and Social Science.

I spoke today at a meeting of science journalists, in a session organized by Betsy Mason, also featuring Kristin Sainani, Christie Aschwanden, and Tom Siegfried.
My talk was on statistical paradoxes of science and science journalism, and I mentioned the Ted Talk paradox, Who watches the watchmen, the Eureka bias, the "What does not kill my statistical significance makes it stronger" fallacy, the unbiasedness fallacy, selection bias in what gets reported, the Australia hypothesis, and how we can do better.
Sainani gave some examples illustrating that journalists with no particular statistical or subject-matter expertise should be able to see through some of the claims made in published papers, where scientists misinterpret their own data or go far beyond what was implied by their data. Aschwanden and Siegfried talked about the confusions surrounding p-values and recommended that reporters pretty much forget about those magic numbers and instead focus on the substantive claims being made in any study.
After the session there was time for a few questions, and one person stood up and said he worked for a university, he wanted to avoid writing up too many stories that were wrong, but he was too busy to do statistical investigations on his own. What should he do?
Mason replied that he should contact the authors of the studies and push them to explain their results without jargon, answering questions as necessary to make the studies clear. She said that if an author refuses to answer such questions, or seems to be deflecting rather than addressing criticism, that this itself is a bad sign.
I expressed agreement with Mason and said that, in my experience, university researchers are willing and eager to talk with reporters and public relations specialists, and we'll explain our research at interminable length to anyone who will listen.
So I recommended to the reporter that, when he sees a report of an interesting study, that he contact the authors and push them with hard questions: not just "Can you elaborate on the importance of this result?" but also "How might this result be criticized?", "What's the shakiest thing you're claiming?", "Who are the people who won't be convinced by this paper?", etc. Ask these questions in a polite way, not in any attempt to shoot the study down--your job, after all, is to promote this sort of work--but rather in the spirit of fuller understanding of the study.
The post Advice for science writers! appeared first on Statistical Modeling, Causal Inference, and Social Science.

From my 2009 paper with Weakliem:

Throughout, we use the term statistically significant in the conventional way, to mean that an estimate is at least two standard errors away from some “null hypothesis” or prespecified value that would indicate no effect present. An estimate is statistically insignificant if the observed value could reasonably be explained by simple chance variation, much in the way that a sequence of 20 coin tosses might happen to come up 8 heads and 12 tails; we would say that this result is not statistically significantly different from chance. More precisely, the observed proportion of heads is 40 percent but with a standard error of 11 percent--thus, the data are less than two standard errors away from the null hypothesis of 50 percent, and the outcome could clearly have occurred by chance. Standard error is a measure of the variation in an estimate and gets smaller as a sample size gets larger, converging on zero as the sample increases in size.I like that. I like that we get right into statistical significance, we don't waste any time with p-values, we give a clean coin-flipping example, and we directly tie it into standard error and sample size. P.S. Some questions were raised in discussion, so just to clarify: I'm not saying the above (which was published in a magazine, not a technical journal) is a comprehensive or precise definition; I just think it gets the point across in a reasonable way for general audiences. The post My favorite definition of statistical significance appeared first on Statistical Modeling, Causal Inference, and Social Science.

Kevin Brown writes:

I came across the lexicon link to your ‘super plots’ posting today. In it, you plot the association between individual income (X) and republican voting (Y) for 3 states: one assumed to be poor, one middle income, and one wealthy. An alternative way of plotting this, what I call a 'herd effects plot' (based on the vaccine effects lit) would be to put mean state income on the X axis, categorize the individual income into 2 (low-high) income groups, and then plot. This would create a plot with 50 x 2 points (two for each state). It would likely show the convergence of the voting tendencies of the poor and wealthy within high income states. Some of the advantages of this alternative way of plotting is that you could see the association across all 50 states without the graph appearing to be ‘busy’. Also aliens wouldn’t have to keep in mind that Mississippi is a poor state because that information would be explicitly described on the X axis. Could also plot the marginal association with state income as a 3rd line. Here's an example from a recent paper [Importation, Antibiotics, and Clostridium difficile Infection in Veteran Long-Term Care: A Multilevel Case–Control Study, by Kevin Brown, Makoto Jones, Nick Daneman, Frederick Adler, Vanessa Stevens, Kevin Nechodom, Matthew Goetz, Matthew Samore, and Jeanmarie Mayer, published in the Annals of Internal Medicine] looking at herd effects of antibiotic use and a hospital acquired infection that's precipitated by antibiotic use. Each pair of X-aligned dots represents antibiotic users and non-users within one long-term care facility with a given overall level of antibiotic use.My reply: Rather than compare high to low, I would compare upper third to lower third. By discarding the middle third you reduce noise; see this article for explanation. The post An alternative to the superplot appeared first on Statistical Modeling, Causal Inference, and Social Science.

Mark Palko points to this news article by Jeffrey Mervis entitled, "Rand Paul takes a poke at U.S. peer-review panels":

Paul made his case for the bill yesterday as chairperson of a Senate panel with oversight over federal spending. The hearing, titled “Broken Beakers: Federal Support for Research,” was a platform for Paul’s claim that there’s a lot of “silly research” the government has no business funding. . . . Two of the witnesses--Brian Nosek of the University of Virginia in Charlottesville and Rebecca Cunningham of the University of Michigan in Ann Arbor--were generally supportive of the status quo, although Nosek emphasized the importance of replicating findings to maximize federal investments. The third witness, Terence Kealey of the Cato Institute in Washington, D.C., asserted that there’s no evidence that publicly funded research makes any contribution to economic development.Palko places this in the context of headline-grabbing politicians such as William Proxmire (Democrat) and John McCain (Republican) egged on by crowd-pleasing journalists such as Maureen Dowd and Howard Kurtz:

Of course, we have no way of knowing how effective these programs are, but questions of effectiveness are notably absent from McCain/Dowd's piece. Instead it functions solely on the level of mocking the stated purposes of the projects, which brings us to one of the most interesting and for me, damning, aspects of the list: the preponderance of agricultural research. You could make a damned good case for agricultural research having had a bigger impact on the world and its economy over the past fifty years than research in any other field. That research continues to pay extraordinary dividends both in new production and in the control of pest and diseases. It also helps us address the substantial environmental issues that have come with industrial agriculture. As I said before, this earmark coverage with an emphasis on agriculture is a recurring event. I remember Howard Kurtz getting all giggly over earmarks for research on dealing with waste from pig farms about ten years ago and I've lost count of the examples since then. . . .But this new effort by Sen. Paul and others seems to be coming from a different direction. Part of the story, I think, is that a lot of the research funding goes directly to scientists, who are disproportionately liberal Democrats, compared to the general population. So I could see how a conservative Republican could say: Hey, why are we giving money to these people. As a scientist who does a lot of government-funded research that is put to use by business, I think liberals, moderates, and conservatives should all support government science funding without worrying about the private political views of its recipients. Yes, this view is in my interest, but it's also what I feel: Science is a public good in so many ways. But the point is that, for some conservatives, there's a real tradeoff here in that money is going for useful things but it's going to people with political views they don't like. I guess one could draw an analogy to liberals' perspectives on military funding. If you're a liberal Democrat and you support military spending, you have to accept that a lot of this money is going to conservative Republicans. I say all this not in any attempt to discredit the proposals of Sen. Paul or others--these ideas should be evaluated on their merits--but just to look at these debates from a somewhat different political perspective here. When I talk about scientists being disproportionately liberal Democrats, I'm not talking about postmodernists in the Department of Literary Theory; I'm talking about chemists, physicists, biologists, etc. The post Science funding and political ideology appeared first on Statistical Modeling, Causal Inference, and Social Science.

I missed two weeks and haven't had time to create a dedicated blog for Stan yet, so we're still here. This is only the update for this week. From now on, I'm going to try to concentrate on things that are done, not just in progress so you can get a better feel for the pace of things getting done.
NOT ONE, BUT TWO NEW DEVS!
This is my favorite news to post, hence the exclamation.
* MATTHIJS VáKáR from University of Oxford joined the dev team. Matthijs's first major commit is a set of GLM functions for negative binomial with log link (2-6 times speedup), normal linear regression with identity link (4-5 times), Poisson with log link (factor of 7) and bernoulli with logit link (9 times). Wow! And he didn't just implement the straight-line case--this is a fully vectorized implementation as a density, so we'll be able to use them this way:
int y[N]; // observations matrix[N, K] x; // "data" matrix vector[K] beta; // slope coefficients real alpha; // intercept coefficient y ~ bernoulli_logit_glm(x, beta, alpha);
These stand in for what is now written as
y ~ bernoulli_logit(x * beta + alpha);
and before that was written
y ~ bernoulli(inv_logit(x * beta + alpha));
Matthijs also successfully defended his Ph.D. thesis--welcome to the union, Dr. Vákár.
* ANDREW JOHNSON from Curtin University also joined. In his first bold move, he literally refactored the entire math test suite to bring it up to cpplint standard. He's also been patching doc and other issues.
VISITORS
* KENTARO MATSURA, author of _Bayesian Statistical Modeling Using Stan and R_ (in Japanese) visited and we talked about what he's been working on and how we'll handle the syntax for tuples in Stan.
* SHUONAN CHEN visited the Stan meeting, then met with Michael (and me a little bit) to talk bioinformatics--specifically about single-cell PCR data and modeling covariance due to pathways. She had a well-annotated copy of Kentaro's book!
OTHER NEWS
* BILL GILLESPIE presented a Stan and Torsten tutorial at ACoP.
* CHARLES MARGOSSIAN had a poster at ACoP on mixed solving (analytic solutions with forcing functions); his StanCon submission on steady state solutions with the algebraic solver was accepted.
* KRZYSZTOF SAKREJDA nailed down the last bit of the standalone function compilation, so we should be rid of regexp based C++ generation in RStan 2.17 (coming soon).
* BEN GOODRICH has been cleaning up a bunch of edge cases in the math lib (hard things like the Bessel functions) and also added a chol2inv() function that inverts the matrix corresponding to a Cholesky factor (naming from LAPACK under review--suggestions welcome).
* BOB CARPENTER and MITZI MORRIS taught a one-day Stan class in Halifax at Dalhousie University. Lots of fun seeing Stan users show up! MIKE LAWRENCE, of Stan tutorial YouTube fame, helped people with debugging and installs--nice to finally meet him in person.
* BEN BALES got the metric initialization into CmdStan, so we'll finally be able to restart (the metric used to be called the mass matrix--it's just the inverse of a regularized estimate of global posterior covariance during warmup)
* MICHAEL BETANCOURT just returned from SACNAS (diversity in STEM conference attended by thousands).
* MICHAEL also revised his history of MCMC paper, which as been conditionally accepted for publication. Read it on arXiv first.
* AKI VEHTARI was awarded a two-year postdoc for a joint project working on Stan algorithms and models jointly supervised with ANDREW GELMAN; it'll also be joint between Helsinki and New York. Sounds like fun!
* BRECK BALDWIN and SEAN TALTS headed out to Austin for the NumFOCUS summit, where they spent two intensive days talking largely about project governance and sustainability.
* IMAD ALI is leaving Columbia to work for the NBA league office (he'll still be in NYC) as a statistical analyst! That's one way to get access to the data!
The post Stan Roundup, 27 October 2017 appeared first on Statistical Modeling, Causal Inference, and Social Science.

I happened to come across this Los Angeles Times article from last year:

Labor and business leaders declared victory Tuesday night over a bitterly contested ballot measure that would have imposed new restrictions on building apartment towers, shops and offices in Los Angeles. As of midnight, returns showed Measure S going down to defeat by a 2-1 margin, with more than half of precincts reporting. . . . More than $13 million was poured into the fight, funding billboards, television ads and an avalanche of campaign mailers.OK, fine so far. Referenda are inherently unpredictable so both sides can throw money into a race without any clear sense of who's gonna win. But then this:

The Yes on S campaign raised more than $5 million — about 99% of which came from the nonprofit AIDS Healthcare Foundation — to promote the ballot measure.Huh? The AIDS Healthcare Foundation is spending millions of dollars on a referendum on development in L.A.? That's weird. Maybe somewhere else in the city there's a Low Density Housing Foundation that just spent 5 million bucks on AIDS. The post Quick Money appeared first on Statistical Modeling, Causal Inference, and Social Science.

Someone points me with amusement to this published article from 2012:

Beliefs About the “Hot Hand” in Basketball Across the Adult Life Span Alan Castel, Aimee Drolet Rossi, and Shannon McGillivray University of California, Los Angeles Many people believe in streaks. In basketball, belief in the “hot hand” occurs when people think a player is more likely to make a shot if they have made previous shots. However, research has shown that players’ successive shots are independent events. To determine how age would impact belief in the hot hand, we examined this effect across the adult life span. Older adults were more likely to believe in the hot hand, relative to younger and middle-aged adults, suggesting that older adults use heuristics and potentially adaptive processing based on highly accessible information to predict future events.My correspondent writes: "This paper is funny, I didn't realize how strongly the psych community bought the null hypothesis." To be fair, back in 2012, I didn't think the hot hand was real either . . . But what really makes it work is this quote from the first paragraph of the above-linked paper:

Anecdotally, many fans, and even coaches, profess belief in the hot hand. For example, Phil Jackson, one of the most successful coaches in the history of the National Basketball Association (NBA), once said of Kobe Bryant, in Game 5 of the 2010 NBA Finals: “He’s the kind of guy (where) you ride the hot hand, that’s for sure, we were waiting for him to do that . . . . He went out there and found a rhythm.”I mean, if you want to know about basketball, who ya gonna trust, a mountain of p-values . . . or that poseur Phil Jackson?? Have you seen how bad the Knicks were last year? Zen master, my ass. The post If you want to know about basketball, who ya gonna trust, a mountain of p-values . . . or that poseur Phil Jackson?? appeared first on Statistical Modeling, Causal Inference, and Social Science.

Mark Tuttle writes:

If/when the spirit moves you, you should contrast the success of the open software movement with the challenge of published research. In the former case, discovery of bugs, or of better ways of doing things, is almost always WELCOMED. In some cases, submitters of bug reports, patches, suggestions, etc. get “merit badges” or other public recognition. You could relate your experience with Stan here. In contrast, as you observe often, a bug report or suggestion regarding a published paper is treated as a hostile interaction. This is one thing that happens to me during more open review processes, whether anonymous or not. The first time it happened I was surprised. Silly me, I expected to be thanked for contributing to the quality of the result (or so I thought). Thus, a simple-to-state challenge is to make publication of research, especially research based on data, more like open software. As you know, sometimes open software gets to be really good because so many eyes have reviewed it carefully.I just posted something related yesterday, so this is as good a time to respond to Tuttle's point, which I this is a good one. We've actually spent some time thinking about how to better reward people in the Stan community who help out with development and user advice. Regarding resistance to "bug reports" in science, here's what I wrote last year:

We learn from our mistakes, but only if we recognize that they are mistakes. Debugging is a collaborative process. If you approve some code and I find a bug in it, I’m not an adversary, I’m a collaborator. If you try to paint me as an “adversary” in order to avoid having to correct the bug, that’s your problem.It's a point worth making over and over. Getting back to the comparison with "bug reports," I guess the issue is that developers want their software to work. Bugs are the enemy! In contrast, many researchers just want to _believe_ (and have others believe) that their ideas are correct. For them, errors are not the enemy; rather, the the enemy is any admission of defeat. Their response to criticism is to paper over any cracks in their argument and hope that nobody will notice or care. With software it's harder to do that, because your system will keep breaking, or giving the wrong answer. (With hardware, I suppose it's even more difficult to close your eyes to problems.) The post In the open-source software world, bug reports are welcome. In the science publication world, bug reports are resisted, opposed, buried. appeared first on Statistical Modeling, Causal Inference, and Social Science.

Justin Esarey writes:

This Friday, October 27th at noon Eastern time, the International Methods Colloquium will host a roundtable discussion on the reproducibility crisis in social sciences and a recent proposal to impose a stricter threshold for statistical significance. The discussion is motivated by a paper, "Redefine statistical significance," recently published in _Nature Human Behavior_ (and available at https://www.nature.com/articles/s41562-017-0189-z).Our panelists are:* Daniel Benjamin, Associate Research Professor of Economics at the University of Southern California and a primary co-author of the paper in _Nature Human Behavior _as well as many other articles on inference and hypothesis testing in the social sciences. * Daniel Lakens, Assistant Professor in Applied Cognitive Psychology at Eindhoven University of Technology and an author or co-author on many articles on statistical inference in the social sciences, including the Open Science Collaboration's recent _Science_ publication "Estimating the reproducibility of psychological science" (available at https://dx.doi.org/10.1126/science.aac4716). * Blake McShane, Associate Professor of Marketing at Northwestern University and a co-author of the recent paper "Abandon Statistical Significance" as well as many other articles on statistical inference and replicability. * Jennifer Tackett, Associate Professor of Psychology at Northwestern University and a co-author of the recent paper "Abandon Statistical Significance" who specializes in childhood and adolescent psychopathology. * E.J. Wagenmakers, Professor at the Methodology Unit of the Department of Psychology at the University of Amsterdam, a co-author of the paper in _Nature Human Behavior_ and author or co-author of many other articles concerning statistical inference in the social sciences, including a meta-analysis of the "power pose" effect (available at http://www.tandfonline.com/doi/abs/10.1080/23743603.2017.1326760).To tune in to the presentation and participate in the discussion after the talk, visit http://www.methods-colloquium.com/and click "Watch Now!" on the day of the talk. To register for the talk in advance, click here:The IMC uses Zoom, which is free to use for listeners and works on PCs, Macs, and iOS and Android tablets and phones. You can be a part of the talk from anywhere around the world with access to the Internet. The presentation and Q&A will last for a total of one hour.

This sounds great.

The post This Friday at noon, join this online colloquium on replication and reproducibility, featuring experts in economics, statistics, and psychology! appeared first on Statistical Modeling, Causal Inference, and Social Science.
Brian Resnick writes:

I'm hoping you could help me out with a Vox.com story I'm looking into. I've been reading about the debate over how past work should be criticized and in what forums. (I'm thinking of the Susan Fiske op-ed against using social media to "bully" authors of papers that are not replicating. But then, others say the social web is needs to be an essential vehicle to issue course corrections in science.) This is what I'm thinking: It can't feel great to have your work criticized by strangers online. That can be true regardless of the intentions of the critics (who, as far as I can tell, are doing this because they too love science and want to see it thrive). And it can be true even if the critics are ultimately correct. (My pet theory is that this "crisis" is actually confirming a lot of psychological phenomenon--namely motivated reasoning) Anyway: I am interested in hearing some stories about dealing with replication failure during this "crisis." (Or perhaps some stories about being criticized for being a critic.) How did these instances change the way you thought about yourself as a scientist? Could you really separate your intellectual reaction from your emotional one? This isn't about infighting and gossip: I think there's an important story to be told about what it means to be a scientist in the age of the social internet. Or maybe the story is about how this period is changing (or reaffirming) your thoughts of what it means to be a scientist. Let me know if you have any thoughts or stories you'd like to share on this topic! Or perhaps you think I'm going about this the wrong way. That's fine too.My reply: You write, "It can't feel great to have your work criticized by strangers online." Actually, I _love_ getting my work criticized, by friends or by strangers, online or offline. When criticism gets personal, it can be painful, but it is by criticism that we learn, and the challenge is to pull out the useful content. I have benefited many many times from criticism. Here's an example from several years ago. In March 2009 I posted some maps based on the Pew pre-election polls to estimate how Obama and McCain did among different income groups, for all voters and for non-Hispanic whites alone. The next day the blogger and political activist Kos posted some criticisms. The criticisms were online, non-peer-reviewed, by a stranger, and actually kinda rude. So what, who cares! Not all of Kos's criticisms were correct but some of them were right on the mark, and they motivated me to spend a couple of months with my colleague Yair Ghitza improving my model; the story is here. Yair and I continued with the work and a few years later published a paper in the American Journal of Political Science. A few years after _that_, Yair and I, with Rayleigh Lei, published a followup in which we uncovered problems with our earlier published work. So, yeah, I think criticism is great. If people don't want their work criticized by strangers, I recommend they not publish or post their work for strangers to see. P.S. This post happens to be appearing shortly after a discussion on replicability and scientific criticism. Just a coincidence. I wrote the post several months ago (see here for the full list). The post I think it's great to have your work criticized by strangers online. appeared first on Statistical Modeling, Causal Inference, and Social Science.

For the Data Science Seminar, Wed 25 Oct, 3:30pm in Physics and Astronomy Auditorium – A102:

The Statistical Crisis in Science Top journals routinely publish ridiculous, scientifically implausible claims, justified based on “p < 0.05.” And this in turn calls into question all sorts of more plausible, but not necessarily true, claims, that are supported by this same sort of evidence. To put it another way: we can all laugh at studies of ESP, or ovulation and voting, but what about MRI studies of political attitudes, or stereotype threat, or, for that matter, the latest potential cancer cure? If we can’t trust p-values, does experimental science involving human variation just have to start over? And what to we do in fields such as political science and economics, where preregistered replication can be difficult or impossible? Can Bayesian inference supply a solution? Maybe. These are not easy problems, but they’re important problems.For the Department of Biostatistics, Thurs 26 Oct, 3:30pm in Room T-639 Health Sciences:

Bayesian Workflow Bayesian inference is typically explained in terms of fitting a particular model to a particular dataset. But this sort of model fitting is only a small part of real-world data analysis. In this talk we consider several aspects of workflow that have not been well served by traditional Bayesian theory, including scaling of parameters, weakly informative priors, predictive model evaluation, variable selection, model averaging, checking of approximate algorithms, and frequency evaluations of Bayesian inferences. We discuss the application of these ideas in various applications in social science and public health.P.S. It appears I'll have some time available on Wed morning so if anyone has anything they want to discuss, just stop by; I'll be at the eScience Institute on the 6th floor of the Physics/Astronomy Tower. The post My 2 talks in Seattle this Wed and Thurs: "The Statistical Crisis in Science" and "Bayesian Workflow" appeared first on Statistical Modeling, Causal Inference, and Social Science.

The starting point is that we’ve seen a lot of talk about frivolous science, headline-bait such as the study that said that married women are more likely to vote for Mitt Romney when ovulating, or the study that said that girl-named hurricanes are more deadly than boy-named hurricanes, and at this point some of these studies are almost pre-debunked. Reporters are starting to realize that publication in Science or Nature or PNAS is, not only no guarantee of correctness but also no guarantee that a study is even reasonable.
But what I want to say here is that even serious research is subject to exaggeration and distortion, partly through the public relations machine and partly because of basic statistics. The push to find and publicize so-called statistically significant results leads to overestimation of effect sizes (type M errors), and crude default statistical models lead to broad claims of general effects based on data obtained from poor measurements and nonrepresentative samples.
One example we've discussed a lot is that claim of the effectiveness of early childhood intervention, based on a small-sample study from Jamaica. This study is _not_ "junk science." It's a serious research project with real-world implications. But the results still got exaggerated. My point here is not to pick on those researchers. No, it's the opposite: _even top researchers exaggerate in this way_ so we should be concerned in general.
What to do here? I think we need to proceed on three tracks:
1. Think more carefully about data collection when designing these studies. Traditionally, design is all about sample size, not enough about measurement.
2. In the analysis, use Bayesian inference and multilevel modeling to partially pool estimated effect sizes, thus giving more stable and reasonable output.
3. When looking at the published literature, use some sort of Edlin factor to interpret the claims being made based on biased analyses.
The above remarks are general, indeed it was inspired by a discussion we had a few months ago about the design and analysis of psychology experiments, as I think there's some misunderstanding in which people don't see where assumptions are coming into various statistical analyses (see for example this comment).
The post The Publicity Factory: How even serious research gets exaggerated by the process of scientific publication and reporting appeared first on Statistical Modeling, Causal Inference, and Social Science.