Data Dredging - Darren's Public Notes

# 38. Data dredging ## 38.1. Definition Data dredging involves creating misleading relationships in a dataset. It's the equivalent of looking for an answer (any answer) before having phrased the question. It is a misuse of the techniques of data-mining and statistical analyses, such as regression analysis, with a manipulative intention. Relationships found by dredging data might appear valid within the test set but they have no statistical significance in the wider population. It has become very popular since the advent of very large databases and the use of relational database technology. We should note that data dredging can sometimes be a valid way of finding a possible hypothesis. But such a hypothesis must then be tested with data not in the original dredged dataset. It is misused when a hypothesis is stated as a fact without further validation and is only tested using data that actually originated the hypothesis in the first place. Data dredging occurs when researchers browse data looking for relationships rather than forming a hypothesis before looking at the data. Another example is when subsets of data are deliberately chosen to create the illusion of significant patterns in deliberately narrowed down data sets. In data dredging, large compilations of data are examined to find a correlation, without any pre-defined choice of a hypothesis to be tested. Since the required confidence interval to establish a relationship between two parameters is usually chosen to be 95% (meaning that there is a 95% chance that the relationship observed is not due to random chance), there is a thus a 5% chance of finding a correlation between any two sets of completely random variables. Because data dredging exercises typically examine large datasets with many variables, it is almost certain that apparently statistically significant results will be found somewhere in the data, even though they are entirely spurious and coincidental. This technique can be used in any field, but it is most often used in medical and other scientific research and the financial environment where interested parties fish the data for apparently interesting correlations and relationships. For example, suppose that observers note that a particular town appears to have a cluster of cancers in their area, but the observers lack a firm hypothesis as to why this is. The researchers have access to a large amount of demographic data about the town and area, containing measurements for the area of hundreds of different mostly uncorrelated variables. Even if all these variables are independent of the cancer incidence rate, it is highly likely that at least one variable correlates significantly with the cancer rate. Whilst this may suggest a hypothesis, further testing using the same variables but with data from different locations is needed to confirm the hypothesis. ### 38.1.1. Traditional scientific methodology For the lay reader it is important to understand the methodology used in conventional scientific research and how this provides safeguards against manipulative interference. Conventional scientific method calls for a researcher to formulate a hypothesis, collect relevant data, use some method of statistical analysis to establish some form of correlation and then carry out a statistical significance test to see whether the results could be due to the effects of chance (the so-called null hypothesis). The results are then compared to the hypothesis to prove or disprove its truth. A vital issue in proper statistical analysis is to test a hypothesis with data that was not used in constructing the hypothesis. This is central to the integrity of a scientific process, because every data set contains some patterns which are due entirely to chance. If a hypothesis is not tested with a different dataset from the original study population then it is impossible to determine if the patterns found are chance patterns or whether they have some real significance. If we toss a coin 11 times and get heads 5 times and tails 6 times we could conclude a hypothesis that the coin favours tails between 6/11 - 7/11. However, testing this theory on the same data set will only confirm the theory and such confirmation will have no meaning. The statistical significance of the theory needs to be tested on a completely new, fresh dataset, using a new set of coin tossing results. It is important to realise that proving the statistical significance of a hypothesis when it has been concluded using an incorrect procedure like data dredging is also completely spurious, and so statistical "significance tests" do not protect against data dredging. The researcher has deviated fundamentally from sound and objective scientific methods. ### 38.1.2. Traditional Data Mining and testing hypotheses The process of data mining involves automatically testing huge numbers of hypotheses about a single data set by exhaustively searching for combinations of variables that show a reasonable correlation. Any apparently correlating sets are then tested for statistical significance. But even when enough hypotheses have been tested, it is virtually certain that some will falsely appear to be statistically significant, because every data set with any degree of randomness will contain some spurious correlations. Researchers using data mining techniques are often easily misled by these apparently significant results, even though they are normal properties of random variation. In addition, researchers often examine subsets of data which can alter the statistical tests for significance and error, in the absence of which we can arrive at misinformed conclusions. So, even in the traditional research environment, using appropriate analysis methodologies, there are many possibilities of coming to the wrong conclusion. Using illicit methods like data dredging is even more dangerous, and indeed prone to be used by a manipulator. ### 38.1.3. The Dangers of Dredging Circumventing the traditional scientific approach by conducting an experiment without a hypothesis can lead to premature conclusions. Data mining can be used negatively to seek more information from a data set than it actually contains. Failure to adjust existing statistical models when applying them to new datasets can also result in the occurrences of new patterns between different attributes that would otherwise have not shown up. ## 38.2. Persistence Short to Long. ## 38.3. Accessibility Low. This is definitely a technique for a manipulative but well informed statistician, with access to large amounts of data and the data mining skills and dredging software to examine this data. ## 38.4. Conditions/Opportunity/Effectiveness Provided you have the data, the knowledge of statistical manipulation and the necessary database technology, this is an easily used technique which can produce stunningly convincing misconceptions. ## 38.5. Methodology/Refinements/Sub-species None known ## 38.6. Avoidance and Counteraction The only thing a victim can do is to demand access to the data sets and sources of data, an explanation of methodology and the results of tests for statistical significance. This is a very tricky form of technical manipulation which can be very hard to circumvent, given the poor understanding of scientific method and statistics in the general population.