Size Matters - Darren's Public Notes

# 44. Statistical manipulation - Size Matters ## 44.1. Definition Sample size manipulation is a typical abuse of statistical method in which the manipulator tries to generalise a correlation based on a small sample which is statistically insignificant. This is often done to demonstrate the statistical "truth" of a correlation which cannot honestly be generalised to the larger population because it is based on too small a sample. As we have already mentioned, it is normal practice to look for correlations in just a sample of a huge population. It is impractical to look at entire populations to prove statistical links. So it is common to select a random sample of a set of data to research if correlations exist between various causes and a particular effect, i.e. between a group of independent variables and a single dependant variable. For instance, if we wanted to establish if a relationship exists between bronchitis and smoking in the general population, we would first need to establish the approximate incidence of the disease and the incidence of smoking, and then use these values to determine a statistically usable and reliable sample size. This sample size determination is an important process in statistical analysis, and if it is done incorrectly, it can give rise to wildly misleading results. So, for example, if the incidence of bronchitis is only approximately 1 in 10,000 in the general population, then using a sample group of 50 people is obviously not going to be useful, since it is statistically unlikely to demonstrate even one incidence of the disease. Therefore, obviously a larger group is required. On the other hand, if the incidence of bronchitis is 1 in 5, then a group of 50 subjects may well yield some usable data. In general, the bigger the sample size, the more accurate will be the results of our statistical analysis. Several mathematical techniques allow us to calculate an appropriate sample size to achieve a particular confidence level and error. In legitimate studies, before conducting a statistical analysis like Regression Analysis, we would need to calculate and define an appropriate sample size for our data set, to make conclusions in which we have a particular level of confidence (normally 95%) and a particular expectation of level error (like 5%). Using samples which are not properly calculated can still give rise to very strong correlations, and this fact is often used by manipulators to deliver a deliberately invalid conclusion to a victim. The fact that the correlations are completely dubious does not stop the manipulator from using their conclusions to dupe a gullible public. For instance, a headline reads, "33% of drinking water test results showed E.Coli contamination". This may be based on just 3 results, one of which showed contamination. In a larger sample, it could well be that only 1 in 1000 samples are contaminated, but the headline may be correct within the scope of the "study" of a very tiny sample. However it is disingenuous and fraudulent to extrapolate a conclusion of 33% contamination, as the generalisation was based on just 3 samples. ## 44.2. Persistence Low-Medium. Surely, someone will see through a cheap trick like this? Well maybe, but the headline has delivered the message. It is much harder to erase the headline in the public perception than it is to make it in the first place. ## 44.3. Accessibility High. With access to the appropriate data, assertions like these can be delivered by almost anyone. Few members of the public can authoritatively challenge the validity of such statements, so they can easily gain currency. ## 44.4. Conditions/Opportunity/Effectiveness Messing around with sample size to get a different conclusion is a cheap trick, bordering on fraud. For a statistician it's a "no-brainer". Amazingly however, the press has easily sold all kinds of statistically dodgy stories, using absurdly manipulated sample sizes which later give rise to headlines. Even more amazing is that such stories persist. Some even persist although they have been challenged and proven to be ingenuous. It's hard to dislodge a good headline from the public perception, especially when the alternative is some complicated argument about confidence levels and so on. Most lay-people want simple–to-understand messages not nuanced statistical statements. And newspapers want catchy headlines that grab the readers' attention. Why let the truth get in the way of a good story? ## 44.5. Methodology/Refinements/Sub-species None known. ## 44.6. Avoidance and Counteraction Avoidance is simple. Just demand to see the sample size and the sample size calculation; ask for the confidence level and the standard error. If these basic calculations are not available, the headline conclusions are worthless.