When performing data analysis, it can be easy to slide into a few traps and end up making mistakes. Diligence is essential, and it's wise to keep an eye out for the following 7 potential mistakes you can make. These include:
Let's take a look at why each one can be problematic and how you might be able to avoid these issues.
Sampling biasoccurs when a non-representative sample is used. For example, a political campaign might sample 1,300 voters only to find out that one political party's members are dramatically overrepresented in the pool. Sampling bias should be avoided because it can weigh the analysis too far in one particular direction.Cherry-pickinghappens when data is stacked to support a particular hypothesis. It's one of the more intentional problems that appear on this list because there's always a temptation to give the analysis a nudge in the "right" direction. Not only is cherry-picking unethical, but it may have more serious consequences in fields like public policy, engineering, and health.Disclosing metricsis a problem because a metric becomes useless once subjects know its value. This ends up creating problems like the habit in the education field of teaching to what's on standardized tests. A similar problem occurred in the early days of internet search when websites started flooding their content with keywords to game the way pages were ranked.Overfittingtends to happen during the analysis process. Someone might have a model, for example, and the curve produced by the model seems to be predictive. Unfortunately, the curve is only a curve because the data fits the model. The failure of the model may only become apparent, however, when the model is compared to future observations that aren't so well-fitted.Focusing only on the numbersis worrisome because it can have adverse real-world consequences. For example, existing social biases can be fed into models. A company handling lending might produce a model that induces geographic bias by using data derived from biased sources. The numbers may look clean and neat, but the underlying biases can be socially and economically turbulent.Solution biascan be thought of as the gentler cousin of cherry-picking. With solution bias, a solution might be so cool, interesting or elegant that it's hard not to fall in love with. Unfortunately, the solution might be wrong, and appropriate levels of scientific and mathematical rigor might not be applied because refuting the solution would just seem disheartening.Communicating poorlyis more problematic than you might expect. Producing analysis is one thing, but conveying findings in an accessible manner to people who didn't participate in the project is critical. Data scientists need to be comfortable with producing elegant and engaging dashboards, charts and other work products to ensure their findings are well-communicated.
Process and diligence are your primary weapons in combating mistakes in data analysis. First, you must have a process in place that emphasizes the importance of getting things right. When you're creating a data science experiment, there need to be checks in place that will force you to stop and consider things like:
Diligence is also essential. You should be looking at concerns about whether:
Tackling a data science project requires sufficient and ample planning. You also have to consider ways to refine your work and to keep improving your processes over time. It takes commitment, but a group with the right culture can do a better job of steering clear of avoidable mistakes.Back to blog homepage