Numbers would never lie, would they? Yes, they would! The GIGO principle and statistics.
Disclaimer: This article is written for non-IT people who are interested in this topic And it may contain generalization for better understanding.
Have you ever seen an episode of the sesame street? There is one character which I particularly like to compare to data scientists: Oscar the grouch. This is because both, Oscar and the scientist, are singing “I love trash”. But what do I want to tell you with this highly polemic comparison? Well, data scientists love data. The more the better. But that is the point: Most of the time, more data does not make it better. It can lead you astray, it can suggest you flatout wrong things and most people just blindly believe numbers. Please don’t do that.
There are some major issues with data quality. Let us first start with the obvious one: The GIGO principle. GIGO is the acronym for Garbage in, garbage out. It means that even the perfect algorithmic model can only provide you with garbage insights if the data that is put into the model is garbage. Let us make an example here: You want to evaluate how good your product is. Therefore you ask every customer to rate their satisfaction with the product on a scale from 1 to 10 after buying it. That seems reasonable. Taking a look into the data, you see that everyone is happy. A year from now, you will ask yourself why none of your customers returned to your store. What happened? All the customers were happy…
The thing is: Your data was not worth the paper it is written on. How happy a customer is with his product will only show some weeks after he bought it. Your data is garbage, because it is inherently not measuring the true satisfaction of your customer with the product, just a momentary and much distorted satisfaction. So thinking about a good measure for customer satisfaction is a big task. And now imagine if you had a whole pile of company data poured over your poor head to analyse. Would you seriously consider every variables validity? You wouldn’t. And that is why you may built a beautiful model, but all your outputs may still be useless. Garbage in, garbage out.
Secondly, there are a bunch of statistical tripwires in data analysis. I want to let you take a glimpse on some of them to just generally raise your doubt about the insights derived from all the data out there. One of my favourite is the idea of significance: Most data that is collected and analysed is checked for significance and if it is significant, it is declared true and relevant. But significance only tells you it is likely to be true, with a probability of 95%. But that means that in one out of twenty times, you will be wrong, even though the numbers will tell you otherwise. That is because after all, you are dealing with random factors. The craziest things happen. This becomes important because if you do a lot of significance testing, say 10.000 tests, and you find 1000 of them to be significant. 50 of these will be significant because of pure randomness. And now guess what is happening in Big Data Software: Thousands of tests are conducted. You can imagine the rest.
The next thing on our list is overfitting. When building statistical models to predict something (aka regression), adding in more factors will not make the model more accurate. In fact, the accuracy of a model can drastically decrease if you factor in too many things. So more data will not make your model more accurate. There is also another kind of overfitting: When you put in too much data in a machine learning algorithm, you may end up letting the model attribute too much of its pattern recognition onto arbitrary factors. To give an example: This algorithm was designed to identify tanks on pictures. As then training data set was too big, the model arrived at the conclusion that tanks only exist in good weather and therefore only saw tanks in the sunshine. That didn’t work out as planned. The data was overfitted.
Although there are ways to overcome these statistical limitations, one should always think about these tripwires when just blindly putting in thousands of data points into your Big Data Software.
To conclude all of this, you should keep one thing in mind: Don’t trust everything you see. Data is not automatically true just because it is written in numbers.
Back to the overview on all Big Data topics
Thanks to 8icons for the icons!