NEW: Heap for mobile. Track every interaction, on every platform.

Learn more
skip to content
Loading...
    • The Digital Insights Platform Transform your digital experience
    • How Heap Works A video guide
    • How Heap Compares Heap vs. competitors
    • The Future of Insights A comic book guide
  • Data Insights

    • Session Replay Complete context with a single click
    • Illuminate Data science that pinpoints unknown friction
    • Journeys Visual maps of all user flows

    Data Analysis

    • Segments User cohorts for actionable insights
    • Dashboards Share insights on critical metrics
    • Charts Analyze everything about your users
    • Playbooks Plug-and-play templates and analyses

    Data Foundation

    • Capture Automatic event tracking and apis
    • Mobile Track and analyze your users across devices
    • Enrichment Add context to your data
    • Integrations Connect bi-directionally to other tools

    Data Management

    • Governance Keep data clean and trusted
    • Security & Privacy Security and compliance made simple
    • Infrastructure How we build for scale
    • Heap Connect Send Heap data directly to your warehouse
  • Solutions

    • Funnel Optimization Improve conversion in user flows
    • Product Adoption Maximize adoption across your site
    • User Behavior Understand what your users do
    • Product Led Growth Manage PLG with data

    Industries

    • SaaS Easily improve acquisition, retention, and expansion
    • eCommerce Increase purchases and order value
    • Financial Services Raise share of wallet and LTV

    Heap For Teams

    • Product Teams Optimize product activation, conversion and retention
    • Marketing Teams Optimize acquisition performance and costs
    • Data Teams Optimize behavioral data without code
  • Pricing
  • Support

    • Heap University Video Tutorials
    • Help Center How to use Heap
    • Heap Plays Tactical how-to guides
    • Heap Updates
    • Professional Services

    Resources

    • Blog A community for digital builders
    • Content Library Ebooks, whitepapers, videos, guides
    • Press News from and about Heap
    • Webinars & Events Virtual and live events
    • Careers Join us

    Ecosystem

    • Customer Community Join the conversation
    • Partners Technology and Solutions Partners
    • Developers
    • Customers Over 8,000 successful companies
  • Free TrialRequest Demo
  • Log In
  • Free Trial
  • Request Demo
  • Log In

All Blogs

Engineering

Using PostgreSQL Arrays The Right Way

Dan Robinson
February 24, 20142 min read
  • Facebook
  • Twitter
  • LinkedIn
Using PostgreSQL Arrays The Right Way

At Heap, we lean on PostgreSQL for most of the backend heavy lifting.[1] each event as an hstore blob, and we keep a PostgreSQL array of events done by each user we track, sorted by time. Hstore lets us attach properties to events in a flexible way, and arrays of events give us great performance, especially for funnel queries, in which we compute the drop off between different steps in a conversion funnel.

In this post, we’ll take a look at a PostgreSQL function that unexpectedly hung on large inputs and rewrite it in an efficient, idiomatic way.

Your first instinct might be to treat arrays in PostgreSQL like their analogues in C-based languages. You might be used to manipulating data via array positions or slices. Be careful not to think this way in PostgreSQL, especially if your array type is variable length, e.g. json, text, or hstore. If you’re accessing a PostgreSQL array via its positions, you’re in for an unexpected performance blowup.

This came up a few weeks ago at Heap. We keep an array of events for each user tracked by Heap, in which we represent each event with an hstore datum. We have an import pipeline that appends new events to the right arrays. In order to make this import pipeline idempotent, we give each event an event_id entry, and we run our event arrays through a utility function that squashes duplicates. If we want to update the properties attached to an event, we just dump a new event into the pipeline with the same event_id.

So, we need a utility function that takes an array of hstores, and, if two events have the same event_id, it should take the one that occurs later in the array. An initial attempt to write this function looked like this:

This works, but it blows up for large inputs. It’s quadratic, and it takes almost 40 seconds for an input array of 100k elements!

Execution Times For dedupe_events_1

Test queries were timed on a macbook pro with a 2.4GHz i7 and 16 GB of ram and generated with this script: https://gist.github.com/drob/9180760.

What’s going on here? The issue is that PostgreSQL stores an array of hstores as an array of values, not an array of pointers to values. That is, an array of three hstores looks something like

{“event\_id=>1,data=>foo”, “event\_id=>2,data=>bar”, “event\_id=>3,data=>baz”}

under the hood, as opposed to

{[pointer], [pointer], [pointer]}

For types that are variable length, e.g. hstores, json blobs, varchars, or text fields, PostgreSQL has to scan the array to find the Nth element. That is, to evaluate events[2], PostgreSQL parses events from the left until it hits the second entry. Then, for events[3], it re-scans from the first index all over again until it hits the third entry! So, evaluating events[sub] is O(sub), and evaluating events[sub] for each index in the array is O(N2), where N is the length of the array.

PostgreSQL could be smarter about caching intermediate parse results, or it could parse the array once in a context like this. The real answer is for arrays of variable-length elements to be implemented with pointers to values, so that we can always evaluate events[i] in constant time.

Even so, we shouldn’t rely on PostgreSQL handling this well, since this is not an idiomatic query. Instead of generate_subscripts, we can use unnest, which parses an array and returns a set of entries. This way, we never have to explicitly index into the array.

This is efficient, and it takes an amount of time that’s linear in the size of the input array. It takes about a half a second for an input of 100k elements, compared to 40 seconds with the previous implementation.

This does what we want:

  • Parse the array once, with unnest.

  • Partition by event_id.

  • Take the last occurrence for each event_id.

  • Sort by input index.

Lesson learned: if you find yourself accessing positions in a PostgreSQL array, consider unnest instead.

We like to avoid stubbing our toes. Have any feedback or other PostgreSQL tips? Shoot us a note @heap.

[1] In particular, we use a lovely tool called Citus Data. More on that in another blog post!

Dan Robinson

Was this helpful?
PreviousNext

Related Stories

See All

  • Google Analytics 4

    Product Insights

    Google Analytics 4: What it promises, and what that really means

    April 28, 2022

  • Heap.io

    How to

    The 3 key first steps to improving CRO

    March 29, 2023

  • Heap.io

    Data Stories

    Celebrating H&R Block as the inaugural winner of the Digital Innovator Award

    March 22, 2023

Subscribe

Sign up to stay on top of the latest posts.

Better insights. Faster.

Request Demo
  • Platform
  • Capture
  • Enrichment
  • Integrations
  • Governance
  • Security & Privacy
  • Infrastructure
  • Illuminate
  • Segments
  • Charts
  • Dashboards
  • Playbooks
  • Use Cases
  • Funnel Optimization
  • Product Adoption
  • User Behavior
  • Product Led Growth
  • Customer 360
  • SaaS
  • eCommerce
  • Financial Services
  • Why Heap
  • The Digital Insights Platform
  • How Heap Works
  • How Heap Compares
  • The Future of Insights
  • Resources
  • Blog
  • Content Library
  • Events
  • Topics
  • Heap University
  • Community
  • Professional Services
  • Company
  • About
  • Partners
  • Press
  • Careers
  • Customers
  • Support
  • Request Demo
  • Help Center
  • Contact Us
  • Pricing
  • Social
  • Twitter
  • Facebook
  • LinkedIn
  • YouTube

© 2023 Heap Inc. All Rights Reserved.

  • Legal
  • Privacy Policy
  • Status
  • Trust