NEW: Heap for mobile. Track every interaction, on every platform.

Learn more
skip to content
Loading...
    • The Digital Insights Platform Transform your digital experience
    • How Heap Works A video guide
    • How Heap Compares Heap vs. competitors
    • The Future of Insights A comic book guide
  • Data Insights

    • Session Replay Complete context with a single click
    • Illuminate Data science that pinpoints unknown friction
    • Journeys Visual maps of all user flows

    Data Analysis

    • Segments User cohorts for actionable insights
    • Dashboards Share insights on critical metrics
    • Charts Analyze everything about your users
    • Playbooks Plug-and-play templates and analyses

    Data Foundation

    • Capture Automatic event tracking and apis
    • Mobile Track and analyze your users across devices
    • Enrichment Add context to your data
    • Integrations Connect bi-directionally to other tools

    Data Management

    • Governance Keep data clean and trusted
    • Security & Privacy Security and compliance made simple
    • Infrastructure How we build for scale
    • Heap Connect Send Heap data directly to your warehouse
  • Solutions

    • Funnel Optimization Improve conversion in user flows
    • Product Adoption Maximize adoption across your site
    • User Behavior Understand what your users do
    • Product Led Growth Manage PLG with data

    Industries

    • SaaS Easily improve acquisition, retention, and expansion
    • eCommerce Increase purchases and order value
    • Financial Services Raise share of wallet and LTV

    Heap For Teams

    • Product Teams Optimize product activation, conversion and retention
    • Marketing Teams Optimize acquisition performance and costs
    • Data Teams Optimize behavioral data without code
  • Pricing
  • Support

    • Heap University Video Tutorials
    • Help Center How to use Heap
    • Heap Plays Tactical how-to guides
    • Heap Updates
    • Professional Services

    Resources

    • Blog A community for digital builders
    • Content Library Ebooks, whitepapers, videos, guides
    • Press News from and about Heap
    • Webinars & Events Virtual and live events
    • Careers Join us

    Ecosystem

    • Customer Community Join the conversation
    • Partners Technology and Solutions Partners
    • Developers
    • Customers Over 8,000 successful companies
  • Free TrialRequest Demo
  • Log In
  • Free Trial
  • Request Demo
  • Log In

All Blogs

Data Stories

Garbage In, Garbage Out: How Anomalies Can Wreck Your Data

Ravi Parikh
May 7, 20144 min read
  • Facebook
  • Twitter
  • LinkedIn
Garbage In, Garbage Out: How Anomalies Can Wreck Your Data Banner

A 1960 census reported a shocking fact: several teenage women across the United States had 12 or more children each. These were women aged 19 or younger who already had massive families.

As it turns out, this never happened. The truth was that the data had been fabricated.A later study showed that between 3 – 5% of government census-takers engaged in some fabrication, sometimes yielding nonsensical results when they got sloppy.

Flawed census data is used every year to build scientific models, do in-depth analysis, and even make large-scale policy decisions. If the data backing up a model is wildly inaccurate, then our model is useless. That is: “garbage in, garbage out.”

This incident is an example of a wider issue in data analysis: anomalous data, or data that contains errors. Let’s look at a couple more examples, and how data visualization can catch these errors.

Outliers vs Anomalies

The census example above illustrates the distinction between outliers and anomalies in your data. An outlier is a legitimate data point that’s far away from the mean or median in a distribution [1]. It may be unusual, like a 9.6-second 100 meter dash, but still within the realm of reality. An anomaly is an illegitimate data point that’s generated by a different process than whatever generated the rest of the data. Let’s take a look at another example of anomalous data.

Researchers from the University of Vienna looked at several recent national elections for evidence of fraud. They used a dataset of district-by-district election results in many countries.

Looking at the raw data to find fraud is rather difficult. There are thousands of electoral districts in each of these countries, and tampering with the results of just a few might be enough to swing an election. But when they visualized the data, anomalies were easy to spot. They plotted voter turnout vs the percentage of votes that went to the winner:

Election Fraud Graph

In most countries with fair elections, we see a neat cluster centered around a certain voter turnout and winning vote level. Canada shows two different clusters, which is explained by a sharp divide between Québécois and English Canada.

But the really interesting elections were in Uganda and Russia. In these elections we see a small but distinctive cluster in the upper right corner. This corresponds to almost 100% voter turnout and 100% of the district’s votes going to the winner. While that fact alone is already quite implausible, the visualization demonstrates that these districts are anomalous as well. It’s a pretty strong case for election fraud.

This is a great example of anomalous data. While the majority of the dataset was produced by one process (legitimate voting data), a small subset was produced by a different process (fraud).

Less Obvious Anomalies

Not all anomalous data sits far outside the rest of the distribution. Every year, Polish high school students take a standardized “Matura” exam that’s similar to the SAT in the United States. Like the SAT, there are language and math components. Here’s the distribution of scores for the math portion of the 2010 test:

Graph of Polish Math Anomalies

It looks pretty even. There doesn’t appear to be anything unusual going on. But the language score distribution looks different:

Graph of Polish Language Anomalies

There’s a massive spike of students who got exactly 21 – 22, and sharp dropoff at 18 – 20. As it turns out, 21 was the passing score that year. The reason we see such an odd distribution is due to the grading process.

Each exam is graded based on a strict key. But after each exam has been graded, there’s a committee that reviews the exams a final time. The committee often finds ways (consciously or not) to give borderline students a few extra points. This is possible on the somewhat subjective language exam, but impossible on the math portion.

With the election data, we might be able to toss out the fraudulent districts and use the rest of the distribution to get a true picture of the elections. But here it’s not so simple. A lot of the 21 and 22 scores were legitimate, and due to the subjective nature of the exam it could be argued that they all were. Still, to do useful analysis on this data, we can look at the overall distribution and get a good idea of what the score distribution at 18 – 22 “should” be.

Lessons Learned

  • Understand your data before you use it to drive decisions.

     

    Data scientists spend a lot of time thinking through the methods they use to analyze their data, but often don’t spend enough time auditing the data itself.

  • Visualization is a powerful tool for detecting problems with data.

     

    If we’d just looked at the raw data, or summary statistics like mean or median, then we might have missed data quality problems in the examples above (especially the last one). Visualizing the data in the right ways made problems immediately apparent.

  • In isolation, a single anomalous data point can look reasonable.

     

    Some anomalies only show up on an aggregate basis. In the Polish exam data, every anomalous data point looked identical to a valid one, until we looked at the entire distribution.

Have other examples of anomalous data? Let us know!

Notes

[1] Outliers don’t have to be far away from the mean or median of a distribution. It’s possible for a multidimensional dataset to have multivariate outliers, or outliers that only appear strange when looking at multiple dimensions. These aren’t necessarily far away from the mean / median of any single dimension.

Ravi Parikh

Was this helpful?
PreviousNext

Related Stories

See All

  • Google Analytics 4

    Product Insights

    Google Analytics 4: What it promises, and what that really means

    April 28, 2022

  • Heap.io

    How to

    The 3 key first steps to improving CRO

    March 29, 2023

  • Heap.io

    Data Stories

    Celebrating H&R Block as the inaugural winner of the Digital Innovator Award

    March 22, 2023

Subscribe

Sign up to stay on top of the latest posts.

Better insights. Faster.

Request Demo
  • Platform
  • Capture
  • Enrichment
  • Integrations
  • Governance
  • Security & Privacy
  • Infrastructure
  • Illuminate
  • Segments
  • Charts
  • Dashboards
  • Playbooks
  • Use Cases
  • Funnel Optimization
  • Product Adoption
  • User Behavior
  • Product Led Growth
  • Customer 360
  • SaaS
  • eCommerce
  • Financial Services
  • Why Heap
  • The Digital Insights Platform
  • How Heap Works
  • How Heap Compares
  • The Future of Insights
  • Resources
  • Blog
  • Content Library
  • Events
  • Topics
  • Heap University
  • Community
  • Professional Services
  • Company
  • About
  • Partners
  • Press
  • Careers
  • Customers
  • Support
  • Request Demo
  • Help Center
  • Contact Us
  • Pricing
  • Social
  • Twitter
  • Facebook
  • LinkedIn
  • YouTube

© 2023 Heap Inc. All Rights Reserved.

  • Legal
  • Privacy Policy
  • Status
  • Trust