Unlock 2025 Benchmark data → Access insights to stay ahead in the digital experience race.

Get the Report
skip to content
Loading...
    • Why Product Analytics And what can it do for you?
    • How Heap Works A video guide
    • How Heap Compares Heap vs. competitors
    • Product Analytics + Digital Experience Analytics A deeper dive
    • The Future of Insights A comic book guide
    Watch a Demo
  • Data Insights

    • Journeys Visual maps of all user flows
    • Sense AI Analytics for everyone
    • Web Analytics Integrate key web metrics
    • Session Replay Complete context with a single click
    • Heatmaps Visualize user behavior instantly
    • Heap Illuminate Data science that pinpoints unknown friction

    Data Analysis

    • Segments User cohorts for actionable insights
    • Dashboards Share insights on critical metrics
    • Charts Analyze everything about your users
    • Playbooks Plug-and-play templates and analyses

    Data Foundation

    • Capture Automatic event tracking and apis
    • Mobile Track and analyze your users across devices
    • Enrichment Add context to your data
    • Integrations Connect bi-directionally to other tools

    Data Management

    • Governance Keep data clean and trusted
    • Security & Privacy Security and compliance made simple
    • Infrastructure How we build for scale
    • Heap Connect Send Heap data directly to your warehouse
  • Solutions

    • Funnel Optimization Improve conversion in user flows
    • Product Adoption Maximize adoption across your site
    • User Behavior Understand what your users do
    • Product Led Growth Manage PLG with data

    Industries

    • SaaS Easily improve acquisition, retention, and expansion
    • Retail and eComm Increase purchases and order value
    • Healthcare Build better digital patient experiences
    • Financial Services Raise share of wallet and LTV

    Heap For Teams

    • Product Teams Optimize product activation, conversion and retention
    • Marketing Teams Optimize acquisition performance and costs
    • Data Teams Optimize behavioral data without code
  • Pricing
  • Support

    • Heap University Video Tutorials
    • Help Center How to use Heap
    • Heap Plays Tactical how-to guides
    • Professional Services

    Resources

    • Down the Funnel Our complete blog and content library
    • Webinars & Events Events and webinar recordings
    • Press News from and about Heap
    • Careers Join us

    Ecosystem

    • Customer Community Join the conversation
    • Partners Technology and Solutions Partners
    • Developers
    • Customers Stories from over 9,000 successful companies
  • Free TrialRequest Demo
  • Log In
  • Free Trial
  • Request Demo
  • Log In

All Blogs

How Postgres Audit Tables Saved Us From Taking Down Production

Matt Dupree
October 26, 20214 min read
  • Facebook
  • Twitter
  • LinkedIn

Audit tables record changes that occur to rows in another table. They’re like commit logs for database tables, and they're typically used to figure out who made what changes when. But surprisingly, we've found them useful for keeping our distributed Postgres cluster stable.

To convince you of the stability-related value of audit tables, we'll cover how audit tables helped us avoid a serious incident. Then, we'll briefly cover how you can create audit tables in your own Postgres databases.

The almost incident

Heap is a product analytics tool that automatically captures web and mobile behavior like page views, clicks, and mobile touches. We're operating at a scale of billions of events per day, which we store across a distributed Postgres cluster.

The cluster has a coordinator node, which contains metadata about what data lives on which worker nodes. All our data is either a user or an event, so our metadata looks like this:

Our write path uses this metadata to determine where incoming users and events need to be written.

Notice that each node is the home for a users table and a corresponding events table. Each users table is located with an events table that contains the events performed by those users. It’s crucial that these two tables are co-located on the same machine, so that we can join them during queries without having to send data over the network. The Kafka consumers that write to these workers check that this invariant is satisfied before doing any writing to workers.

This makes changing the coordinator metadata a potentially dangerous affair. If we accidentally violate the invariant, the consumers won’t ingest any more data until the metadata is fixed. This is a problem because we want our users to be able to answer questions about how their product is being used in real-time; they shouldn't have to wait hours for their event data to show up in Heap analyses. We call this delay "ingestion latency," and we strive to keep it on the order of a minute or two.

Although changing the coordinator metadata is potentially dangerous, it's unavoidable. Sometimes we need to re-balance data across the cluster, or, in our case, perform a schema migration across many shards in the cluster. It was during a schema migration that we nearly caused some severe ingestion latency.

We wanted to perform our schema migration on our distributed events table with no down-time and no ingestion latency. Since you can’t write to a table while changing its schema, we decided that for each shard, we should:

  1. Copy data to a new shard with the new schema on the same worker

  2. Add the new shard id to the coordinator metadata table

  3. Drop the old shard from the coordinator metadata table and the worker

Unfortunately, the Node code we wrote to do this had a bug: it treated a string as if it was a number. After migrating a shard to a new schema, we created a new shard id by adding a fixed number to the old shardid and then dropping the entry for the old shard from the coordinator. However, because the shardid was a string, adding the offset resulted in concatenation, which in turn resulted in a new shardid that was rejected by the coordinator because it was too large to fit into an bigint:

Although we use Typescript to mitigate these types of bugs, when we're working with external systems, it's easy to find yourself in a situation where the types are lying to you. Fortunately, our code didn’t delete any shard data unless metadata for the migrated shard was added to the coordinator. Unfortunately, because of this bug, the coordinator metadata still wound up in a state where it was missing event shard metadata for a particular worker.

A few minutes later, the on-call engineer was paged with a link to our ingestion latency metrics:

ingestion-latency

Ingestion latency was spiking, but after we discovered the bug in our code, we were able to quickly reverse the removal of these shard metadata entries using our audit tables. The SQL to restore the deleted metadata looked a little like this:

Without the audit table, this minor incident would have been much more severe because it would have been much more difficult to track down which shards were deleted and to quickly re-add those shards back to the table. Logs would have been our only way to fix this, and we can all imagine how difficult it would have been to reconstruct the correct table using logs.

How to create audit tables

To make an audit table, start by creating an ordinary table with:

  1. Columns that match the columns of the table you want to audit

  2. Some additional columns that describe the changes made to the audited table.

For example, an audit table for the above pg_dist_shard table would look like this:

The trigger procedure would look like this:

This code uses a few special variables that are automatically defined for triggers:

After defining a trigger like the one above, we can wire it up with the following SQL:

Conclusion

So, audit tables aren’t just useful for seeing who changed what in your database. They also make it easy to undo and/or redo changes that may cause issues downstream in your system. If that sounds appealing and you’re okay with the overhead of introducing a few triggers to the table you’re auditing, consider creating audit tables in your own system. They’re a great fit for metadata tables like these, which are small and contain key information.

If you like nerding out about underrated Postgres features, @-me on Twitter @philosohacker, and if you like working on large data systems, we’re hiring! Check out our team and open roles.

I’d like to thank John Krauss and Daniel Robinson for their essential feedback on earlier drafts of this post.

Matt Dupree

Was this helpful?
PreviousNext

Related Stories

See All

  • Creative visualization of AI CoPilot capability
    article

    Heap announces new generative AI CoPilot

    Heap, the leader in product analytics, unveils AI CoPilot’s open beta today.

  • Heap.io
    article

    What’s Next in Experience Analytics?

    What does the future of analytics hold, and what does it mean for you?

  • Heap.io
    article

    Building a Retention Strategy, Part 2: Connecting Activities to Revenue with a Metrics Tree

    If you read one post from this series, it should be this one.

Better insights. Faster.

Request Demo
  • Platform
  • Capture
  • Enrichment
  • Integrations
  • Governance
  • Security & Privacy
  • Infrastructure
  • Heap Illuminate
  • Segments
  • Charts
  • Dashboards
  • Playbooks
  • Use Cases
  • Funnel Optimization
  • Product Adoption
  • User Behavior
  • Product Led Growth
  • Customer 360
  • SaaS
  • Retail and eComm
  • Financial Services
  • Why Heap
  • Why Product Analytics
  • How Heap Works
  • How Heap Compares
  • ROI Calculator
  • The Future of Insights
  • Resources
  • Blog
  • Content Library
  • Events
  • Topics
  • Heap University
  • Community
  • Professional Services
  • Company
  • About
  • Partners
  • Press
  • Careers
  • Customers
  • DEI
  • Support
  • Request Demo
  • Help Center
  • Contact Us
  • Pricing
  • Social
    • Twitter
    • Facebook
    • LinkedIn
    • YouTube

© 2025 Heap Inc. All Rights Reserved.

  • Legal
  • Privacy Policy
  • Status
  • Trust