NEW: Heap for mobile. Track every interaction, on every platform.

Learn more
skip to content
Loading...
    • The Digital Insights Platform Transform your digital experience
    • How Heap Works A video guide
    • How Heap Compares Heap vs. competitors
    • The Future of Insights A comic book guide
  • Data Insights

    • Session Replay Complete context with a single click
    • Illuminate Data science that pinpoints unknown friction
    • Journeys Visual maps of all user flows

    Data Analysis

    • Segments User cohorts for actionable insights
    • Dashboards Share insights on critical metrics
    • Charts Analyze everything about your users
    • Playbooks Plug-and-play templates and analyses

    Data Foundation

    • Capture Automatic event tracking and apis
    • Mobile Track and analyze your users across devices
    • Enrichment Add context to your data
    • Integrations Connect bi-directionally to other tools

    Data Management

    • Governance Keep data clean and trusted
    • Security & Privacy Security and compliance made simple
    • Infrastructure How we build for scale
    • Heap Connect Send Heap data directly to your warehouse
  • Solutions

    • Funnel Optimization Improve conversion in user flows
    • Product Adoption Maximize adoption across your site
    • User Behavior Understand what your users do
    • Product Led Growth Manage PLG with data

    Industries

    • SaaS Easily improve acquisition, retention, and expansion
    • eCommerce Increase purchases and order value
    • Financial Services Raise share of wallet and LTV

    Heap For Teams

    • Product Teams Optimize product activation, conversion and retention
    • Marketing Teams Optimize acquisition performance and costs
    • Data Teams Optimize behavioral data without code
  • Pricing
  • Support

    • Heap University Video Tutorials
    • Help Center How to use Heap
    • Heap Plays Tactical how-to guides
    • Heap Updates
    • Professional Services

    Resources

    • Blog A community for digital builders
    • Content Library Ebooks, whitepapers, videos, guides
    • Press News from and about Heap
    • Webinars & Events Virtual and live events
    • Careers Join us

    Ecosystem

    • Customer Community Join the conversation
    • Partners Technology and Solutions Partners
    • Developers
    • Customers Over 8,000 successful companies
  • Free TrialRequest Demo
  • Log In
  • Free Trial
  • Request Demo
  • Log In

All Blogs

Data Stories

Anscombe's Quartet, and Why Summary Statistics Don't Tell the Whole Story

Ravi Parikh
April 21, 20144 min read
  • Facebook
  • Twitter
  • LinkedIn

Let’s say we’re looking at a spreadsheet of our customers. We have data about how many times they’ve logged in, how much revenue we’ve earned from them, etc. We can immediately calculate several compelling summary statistics: what’s the average number of logins per customer? What’s the average revenue?What’s the correlation between number of logins and revenue?Summary statistics allow us to describe a vast, complex dataset using just a few key numbers. This gives us something easy to optimize against and use as a barometer for our business.But there’s a danger in relying only on summary statistics and ignoring the overall distribution. We took a look at this earlier as it relates to average revenue per user. In this article, we’re going to dive deeper into how summary statistics can be misleading. Calculating summary statistics, while useful, should only be one piece of your data analysis pipeline.

Anscombe’s Quartet

Perhaps the most elegant demonstration of the dangers of summary statistics is Anscombe’s Quartet. It’s a group of four datasets that appear to be similar when using typical summary statistics, yet tell four different stories when graphed. Each dataset consists of eleven (x,y) pairs as follows:

I II III IV

x y x y x y x y

10.0 8.04 10.0 9.14 10.0 7.46 8.0 6.58

8.0 6.95 8.0 8.14 8.0 6.77 8.0 5.76

13.0 7.58 13.0 8.74 13.0 12.74 8.0 7.71

9.0 8.81 9.0 8.77 9.0 7.11 8.0 8.84

11.0 8.33 11.0 9.26 11.0 7.81 8.0 8.47

14.0 9.96 14.0 8.10 14.0 8.84 8.0 7.04

6.0 7.24 6.0 6.13 6.0 6.08 8.0 5.25

4.0 4.26 4.0 3.10 4.0 5.39 19.0 12.50

12.0 10.84 12.0 9.13 12.0 8.15 8.0 5.56

7.0 4.82 7.0 7.26 7.0 6.42 8.0 7.91

5.0 5.68 5.0 4.74 5.0 5.73 8.0 6.89

All the summary statistics you’d think to compute are close to identical:

  • The average x value is 9 for each dataset

  • The average y value is 7.50 for each dataset

  • The variance for x is 11 and the variance for y is 4.12

  • The correlation between x and y is 0.816 for each dataset

  • A linear regression (line of best fit) for each dataset follows the equation y = 0.5x + 3

So far these four datasets appear to be pretty similar. But when we plot these four data sets on an x/y coordinate plane, we get the following results:

blog-anscombe-quartet-charts

Now we see the real relationships in the datasets start to emerge. Dataset I consists of a set of points that appear to follow a rough linear relationship with some variance. Dataset II fits a neat curve but doesn’t follow a linear relationship (maybe it’s quadratic?). Dataset III looks like a tight linear relationship between x and y, except for one large outlier. Dataset IV looks like x remains constant, except for one outlier as well.

Computing summary statistics or staring at the data wouldn’t have told us any of these stories. Instead, it’s important to visualize the data to get a clear picture of what’s going on.

A Real-World Example

Let’s look at a real dataset that shows exactly how summary statistics can be dangerous.

A great example is the distribution of starting salaries for new law graduates. The National Association of Law Placement (NALP) reports that in 2012, lawyers made $80,798 on average in starting salary. However a look at the salary distribution shows what law salaries really look like:

blog-anscombe-lawyers

It turns out that law graduates usually fall into one of two groups. The majority of new lawyers make somewhere between $35,000 and $75,000 per year, and a sizable minority earns $160,000 per year. What we have here is a bimodal distribution: there are two peaks that arise from two distinct distributions happening within the same dataset. The $80,798 figure reported as the average falls into the trough between the two peaks, and few lawyers have salaries near that number. A much more accurate statement would be that most law graduates make around $50,000 on average, and those who go to one of the top law schools make $160,000 on average.

There’s also something else happening here that we wouldn’t have observed if we hadn’t plotted the data. There’s a giant spike at exactly $160,000 in starting salary, rather than a peak with some variance. Why is $160,000 such a popular number for law salaries? As it turns out, this data isn’t based on actual legal salaries, but based on what law schools report to the NALP as their students’ median starting salaries. There’s a lot of skepticism about the $160,000 figure, and third-party data shows that the distribution might not be so skewed.

Visualizing the data helped in two ways. It gave us a better picture of what realistic starting law salaries look like, and also allowed us to ask a follow-up question that exposed a potential flaw in our data.

When should you use summary statistics?

This isn’t to say that summary statistics are useless. They’re just misleading on their own. It’s important to use these as just one tool in a larger data analysis process.

Visualizing our data allows us to revisit our summary statistics and recontextualize them as needed. For example, Dataset II from Anscombe’s Quartet demonstrates a strong relationship between x and y, it just doesn’t appear to be linear. So a linear regression was the wrong tool to use there, and we can try other regressions. Eventually, we’ll be able to revise this into a model that does a great job of describing our data, and has a high degree of predictive power for future observations.

Ravi Parikh

Was this helpful?
PreviousNext

Related Stories

See All

  • Heap.io

    Data Stories

    Celebrating H&R Block as the inaugural winner of the Digital Innovator Award

    March 22, 2023

  • Heap.io

    Product Updates

    Introducing Heap for mobile: see Everything, Everywhere all at once

    March 14, 2023

  • Heap.io

    Data Stories

    How I shipped a mobile app without tracking and bad things™ happened

    March 15, 2023

Subscribe

Sign up to stay on top of the latest posts.

Better insights. Faster.

Request Demo
  • Platform
  • Capture
  • Enrichment
  • Integrations
  • Governance
  • Security & Privacy
  • Infrastructure
  • Illuminate
  • Segments
  • Charts
  • Dashboards
  • Playbooks
  • Use Cases
  • Funnel Optimization
  • Product Adoption
  • User Behavior
  • Product Led Growth
  • Customer 360
  • SaaS
  • eCommerce
  • Financial Services
  • Why Heap
  • The Digital Insights Platform
  • How Heap Works
  • How Heap Compares
  • The Future of Insights
  • Resources
  • Blog
  • Content Library
  • Events
  • Topics
  • Heap University
  • Community
  • Professional Services
  • Company
  • About
  • Partners
  • Press
  • Careers
  • Customers
  • Support
  • Request Demo
  • Help Center
  • Contact Us
  • Pricing
  • Social
  • Twitter
  • Facebook
  • LinkedIn
  • YouTube

© 2023 Heap Inc. All Rights Reserved.

  • Legal
  • Privacy Policy
  • Status
  • Trust