in the same field. Honestly i see nothing wrong with p hacking. You collected a generous amount of data for an experiment. Lets say your looking at over 15 constructs. What is wrong with finding which influences which? Just like the society, there is so much data around us to tell us information we need. Why not explore our options to tell a story that we might miss because we decided to re run an experiment when we have the data already?
P-hacking can mean a lot of different things and the term denotes something negative by default, so you must be trying to say that you don't think a certain process constitutes p-hacking.
The main issue with all forms of p-hacking is that they destroy the validity of the statistical models that tell you the p-value has any meaning in the first place. Obvious things like dropping outliers you don't like or manipulating the correlating factors you tested for are straight scientific malpractice. Yet they're done all the time. And there are less obvious ways of p-hacking that also kill statistical validity.
Let's say you run one experiment with 100 subjects and calculate a certain p-value for your results. Let's say someone else runs 15 experiments with 100 subjects each, and only uses the one group of 100 that gave him the p-value he wanted. Those p-values are NOT telling you the same thing. Calculating a p-value for an entire 100-sample experiment and cherry-picking your best p-value out of 15 different experiments are two completely different realities. But if the experimenter isn't honest about how many times they ran the samples, the person reading the paper has no way of knowing the difference.
Until they try to replicate, of course. If you're trying to replicate the results someone got the very first time they did the experiment, then you'll have a decent chance of getting the same results. But if you're trying to replicate results that were just the best of 15 different tries, fat chance of replicating that. You're gonna get a "not statistically significant" result like the original researcher did the first 14 times he ran data.
There's lots of versions like that. Lets say you design for 100 samples but don't get the p-value you want, so you keep increasing the samples until you hit your p-value. Now, perhaps that was just a random fluctuation and if you kept increasing after that you would have dropped back into not statistically significant again. So that's p-hacking too.
Or suppose you went into the experiment claiming to look for one relationship, but you test for correlations of 60 other relationships too. By random chance it's likely that ONE of those correlations will be statistically significant in your sample. If you consider that first run to just be some hypothesis finding and check for that relationship again among a completely different sample group, you're in the clear. But if you try to pull out the single correlation you got out of the 60 you tested on your original sample without employing a statistical test that indicates you were looking at 60 different relationships, then your p-value is illegitimate. Those values are based on the assumption that you are testing one set of subjects for one relationship, if you just start mining for a hundred different things at once then you'll catch a random statistical outlier whether there's a valid relationship or not. In that case you will ONLY know the relationship is valid if you then independently test that relationship against different data sets to see if the same thing happens (there are statistical ways to do it with the same data set but they aren't quite as convincing and many researchers fail to do those as well).
And the proof is in the pudding. MOST published experiments in MOST fields aren't reproducible. There are sometimes valid reasons for an experiment not reproducing the same results the second time around, but generally that's an embarrassing fact and evidence of real issues in the field. It's especially problematic when you consider that most experimental results are never reproduced since it's a lot less sexy to reproduce someone else's findings than to look for something completely new yourself. A good 3/4 of the researchers in the world should just be full-time devoted to trying to reproduce results from other scientists. Honestly, that's all a lot of them are capable of doing competently anyway.
This is a good article on how to avoid p-hacking.
Here are three action items that you can take to the bench with you to prevent p-hacking and enhance the reliability of your conclusions:
- Decide your statistical parameters early, and report any changes. This means deciding ahead of time your tests (e.g., performing a two-tailed, equal-variance Student’s T-test for outcome X between group A and B, but not C). If something legitimate came up, such as the variances being decidedly not equal, you could change this parameter—but you should report the rationale behind this change.
- Decide when to stop collecting data and what composes an outlier beforehand. Decide how many replications will be performed (e.g., each sample will be repeated three times exactly) and at what level (e.g., 2.5 times SD) a sample/replicate will be excluded. This prevents stopping early because you have a desired result, and it prevents repeating until the result is closer to what you desire.
- Correct for multiple comparisons, and replicate your own result. If you investigate multiple outcomes, be sure your statistics reflect that. If you came across something interesting, but not in the most savory way (i.e., exploratory fishing), test the new hypothesis again under pre-determined experimental conditions to get a true p-value.
If you're failing to do any of those things, your statistical calculations are not legitimate and you're p-hacking.
Last edited:
"
above my paygrade
