NEW: Heap for mobile. Track every interaction, on every platform.

Learn more
skip to content
Loading...
    • The Digital Insights Platform Transform your digital experience
    • How Heap Works A video guide
    • How Heap Compares Heap vs. competitors
    • The Future of Insights A comic book guide
  • Data Insights

    • Session Replay Complete context with a single click
    • Illuminate Data science that pinpoints unknown friction
    • Journeys Visual maps of all user flows

    Data Analysis

    • Segments User cohorts for actionable insights
    • Dashboards Share insights on critical metrics
    • Charts Analyze everything about your users
    • Playbooks Plug-and-play templates and analyses

    Data Foundation

    • Capture Automatic event tracking and apis
    • Mobile Track and analyze your users across devices
    • Enrichment Add context to your data
    • Integrations Connect bi-directionally to other tools

    Data Management

    • Governance Keep data clean and trusted
    • Security & Privacy Security and compliance made simple
    • Infrastructure How we build for scale
    • Heap Connect Send Heap data directly to your warehouse
  • Solutions

    • Funnel Optimization Improve conversion in user flows
    • Product Adoption Maximize adoption across your site
    • User Behavior Understand what your users do
    • Product Led Growth Manage PLG with data

    Industries

    • SaaS Easily improve acquisition, retention, and expansion
    • eCommerce Increase purchases and order value
    • Financial Services Raise share of wallet and LTV

    Heap For Teams

    • Product Teams Optimize product activation, conversion and retention
    • Marketing Teams Optimize acquisition performance and costs
    • Data Teams Optimize behavioral data without code
  • Pricing
  • Support

    • Heap University Video Tutorials
    • Help Center How to use Heap
    • Heap Plays Tactical how-to guides
    • Heap Updates
    • Professional Services

    Resources

    • Blog A community for digital builders
    • Content Library Ebooks, whitepapers, videos, guides
    • Press News from and about Heap
    • Webinars & Events Virtual and live events
    • Careers Join us

    Ecosystem

    • Customer Community Join the conversation
    • Partners Technology and Solutions Partners
    • Developers
    • Customers Over 8,000 successful companies
  • Free TrialRequest Demo
  • Log In
  • Free Trial
  • Request Demo
  • Log In

All Blogs

Engineering

Speeding Up PostgreSQL With Partial Indexes

Dan Robinson
December 10, 20145 min read
  • Facebook
  • Twitter
  • LinkedIn

Did you know PostgreSQL supports indexing a subset of your table? This enables very fast reads from that subset with almost no index overhead. It’s often the best way to index your data if you want to repeatedly analyze rows that match a given WHERE clause. This makes PostgreSQL a great fit for workflows that involve pre-aggregation with additional ad hoc analysis. In this post, I’ll walk through an example query optimization in which partial indexes are well suited.

Consider a table of events with the following schema:

Each event is associated with a user and has an ID, a timestamp, and a JSON representation of the event. The JSON includes the page path, the event type (e.g. click, page view, form submission), and any other properties that describe the event.

We can use this table to store many different kinds of events, and let’s assume we have some automatic event tracking in place that logs every click, page view, and form submission so that we can analyze it later. We might want to have an internal dashboard that displays some high-value metrics, such as the number of weekly signups or the amount of revenue we are collecting per day. The events that are relevant to this dashboard make up a small fraction of this table — a very small percentage of clicks on your website are to finalize purchases! But they are mixed in with the rest of the table, so our signal-to-noise ratio is low.

We would like to index our data to make these dashboard queries fast. [*1] Let’s start with a signup event, which we define to be a form submission on our /signup/ page. Getting the number of such signups in the first week of September translates to:

On a synthetic dataset with 10 million events, of which 3000 are signups, and without any indexes, this query takes 45 seconds.

Full Indexes On Single Columns: A Mixed Bag

A naive way to improve this performance is by creating single-column indexes for each of the relevant event features: (data->>'type'), (data->>'path'), and time. We can use a bitmap join between results from three indexed scans, which should be fast if the query is selective and the relevant index portions are in memory. Indeed, with these indexes in place, this query takes 200 ms initially, and 20 ms in subsequent runs on our synthetic dataset — a significant improvement over the 45 seconds required by a sequential scan.

But this indexing strategy has a few significant downsides:

  • Write overhead. We need to write to all three indexes on every INSERT / UPDATE / DELETE run against the table. [*2] For a write-heavy workload like the one in this example, that might be too expensive.

  • Limited query set. This strategy restricts our ability to define types of high-value events. It won’t work if we need something more complicated than a range query on one of those JSON fields. What if we want to match a regex, or all pageview paths that start with /signup/ but can include anything thereafter?

  • Disk use. The table in our test dataset takes up 6660 mb, and the three indexes take up a combined 1026 mb, a substantial increase in the amount of hard drive space we need to use to support this table. [*3]

Enter Partial Indexes

We’re only analyzing the 0.03% of the table that constitutes signup events, but this strategy indexes all the rows. We want to be able to run efficient queries over a tiny fraction of a table. In a case like this, a partial index is going to give us the best results.

If we index an unrelated column and restrict our index to events that match our signup definition, PostgreSQL can easily determine where the signup event rows are, and we can query over them even more efficiently than with three complete indexes on the relevant fields. In particular, consider indexing the time field, but only on the rows that match the filter for signup events. That is:

CREATE INDEX event\_signups ON event (time) WHERE (data->>'type') = 'submit' AND (data->>'path') = '/signup/' Using this index, our test query runs in 200ms initially and 2ms on subsequent runs, so there’s a performance improvement to be had if we are running this query often. More importantly, the partial index addresses all three disadvantages of the triple-index approach described above.

  • It only takes up 96 kb, a 10000x improvement over the 1026 mb required to fully index all three fields.

  • We only need to write to the partial index when new rows match our filter for signup events. Since those are 0.03% of our events, this constitutes a significant improvement in write performance: we essentially get this index for free.

  • The partial join allows for filtering with the full expressiveness of PostgreSQL. Our index’s WHERE clause can use any row-filtering expression that we could otherwise use in a query. We can use this approach to match more complicated event definitions, such as regexes, function results, or the prefix-matching example mentioned above.

Avoid Indexing Predicate Results As Booleans

Another approach I’ve seen is to attempt to index the boolean expression

(data->>'type') = 'submit' AND (data->>'path') = '/signup/' directly, with a second term for time. I.e., something like:

CREATE INDEX event\_signup\_time ON event (((data->>'type') = 'submit' AND (data->>'path') = '/signup/'), time) This is worse than either of the approaches above, as it turns out PostgreSQL’s query planner won’t figure out that our example query restricts our result to rows for which the first term in the index is true. That is, the planner doesn’t know that the WHERE clause:

WHERE (data->>'type') = 'submit' AND (data->>'path') = '/signup/' is equivalent to saying that the

((data->>'type') = 'submit' AND (data->>'path') = '/signup/') field of our index must be true. So it uses this index to restrict the time range of events in our scan, but it reads events for which the boolean expression is true as well as the events for which it is false, and then checks that condition for each row after loading it. [*4]

So, we’re reading a lot more rows from disk than necessary, and also executing a nontrivial check for each one. On our synthetic dataset, this query takes 25 seconds for the first run and 8 seconds on successive executions. Under the hood, this is slightly worse than just indexing on the time field, and it performs comparably to doing just that.

Partial indexes are a powerful way to precompute the subset of a table that matches a predicate. Judging by traffic in the #postgresql IRC, they are considerably underused. Compared to full indexes, they allow a greater range of predicates. They can be dramatically more lightweight, requiring fewer writes and less disk space, especially for highly selective filters. Partial indexes should be your default strategy if you’re repeatedly querying a table for a small slice of rows.

Love PostgreSQL? Know of a feature the world should know more about? Ping me @danlovesproofs.

If you find this kind of work interesting, we’re hiring! Check out our engineering team and open roles.

[*1] We might deal with this kind of a situation with table partitioning, in which high-value events and low-value events can live in separate child tables, but this approach is going to be unwieldy if there are many different kinds of high-value events, and we would need to repartition the table every time we wanted to add a new type of high-value event.

[*2] We may get some of the updates for free via the heap-only tuples optimization, but, at least, every INSERT and DELETE will require writing to all three indexes.

[*3] We could index all three fields in one multi-column index, e.g. on ((data->>'type'), (data->>'path'), time). That would take up 755 mb, a 26% savings, but this is still large, and the other drawbacks still apply. What’s more, this index would be less applicable to other queries we might want to run on this data, so, if we’re supporting a few different kinds of high-value events, this might not end up saving us any space.

[*4] The relevant query plan:

Additional resources:

  • What is product analytics?

  • Google analytics alternatives

Dan Robinson

Was this helpful?
PreviousNext

Related Stories

See All

  • Heap.io

    Data Stories

    Celebrating H&R Block as the inaugural winner of the Digital Innovator Award

    March 22, 2023

  • Heap.io

    Product Updates

    Introducing Heap for mobile: see Everything, Everywhere all at once

    March 14, 2023

  • Heap.io

    Data Stories

    How I shipped a mobile app without tracking and bad things™ happened

    March 15, 2023

Subscribe

Sign up to stay on top of the latest posts.

Better insights. Faster.

Request Demo
  • Platform
  • Capture
  • Enrichment
  • Integrations
  • Governance
  • Security & Privacy
  • Infrastructure
  • Illuminate
  • Segments
  • Charts
  • Dashboards
  • Playbooks
  • Use Cases
  • Funnel Optimization
  • Product Adoption
  • User Behavior
  • Product Led Growth
  • Customer 360
  • SaaS
  • eCommerce
  • Financial Services
  • Why Heap
  • The Digital Insights Platform
  • How Heap Works
  • How Heap Compares
  • The Future of Insights
  • Resources
  • Blog
  • Content Library
  • Events
  • Topics
  • Heap University
  • Community
  • Professional Services
  • Company
  • About
  • Partners
  • Press
  • Careers
  • Customers
  • Support
  • Request Demo
  • Help Center
  • Contact Us
  • Pricing
  • Social
  • Twitter
  • Facebook
  • LinkedIn
  • YouTube

© 2023 Heap Inc. All Rights Reserved.

  • Legal
  • Privacy Policy
  • Status
  • Trust