Did you know PostgreSQL supports indexing a subset of your table? This enables very fast reads from that subset with almost no index overhead. It’s often the best way to index your data if you want to repeatedly analyze rows that match a given
WHERE clause. This makes PostgreSQL a great fit for workflows that involve pre-aggregation with additional ad hoc analysis. In this post, I’ll walk through an example query optimization in which partial indexes are well suited.
Consider a table of events with the following schema:
Each event is associated with a user and has an ID, a timestamp, and a JSON representation of the event. The JSON includes the page path, the event type (e.g. click, page view, form submission), and any other properties that describe the event.
We can use this table to store many different kinds of events, and let’s assume we have some automatic event tracking in place that logs every click, page view, and form submission so that we can analyze it later. We might want to have an internal dashboard that displays some high-value metrics, such as the number of weekly signups or the amount of revenue we are collecting per day. The events that are relevant to this dashboard make up a small fraction of this table — a very small percentage of clicks on your website are to finalize purchases! But they are mixed in with the rest of the table, so our signal-to-noise ratio is low.
We would like to index our data to make these dashboard queries fast. [*1] Let’s start with a signup event, which we define to be a form submission on our /signup/ page. Getting the number of such signups in the first week of September translates to:
On a synthetic dataset with 10 million events, of which 3000 are signups, and without any indexes, this query takes 45 seconds.
Full Indexes On Single Columns: A Mixed Bag
A naive way to improve this performance is by creating single-column indexes for each of the relevant event features:
(data->>'path'), and time. We can use a bitmap join between results from three indexed scans, which should be fast if the query is selective and the relevant index portions are in memory. Indeed, with these indexes in place, this query takes 200 ms initially, and 20 ms in subsequent runs on our synthetic dataset — a significant improvement over the 45 seconds required by a sequential scan.
But this indexing strategy has a few significant downsides:
Write overhead. We need to write to all three indexes on every
DELETErun against the table. [*2] For a write-heavy workload like the one in this example, that might be too expensive.
Limited query set. This strategy restricts our ability to define types of high-value events. It won’t work if we need something more complicated than a range query on one of those JSON fields. What if we want to match a regex, or all pageview paths that start with
/signup/but can include anything thereafter?
Disk use. The table in our test dataset takes up 6660 mb, and the three indexes take up a combined 1026 mb, a substantial increase in the amount of hard drive space we need to use to support this table. [*3]
Enter Partial Indexes
We’re only analyzing the 0.03% of the table that constitutes signup events, but this strategy indexes all the rows. We want to be able to run efficient queries over a tiny fraction of a table. In a case like this, a partial index is going to give us the best results.
If we index an unrelated column and restrict our index to events that match our signup definition, PostgreSQL can easily determine where the signup event rows are, and we can query over them even more efficiently than with three complete indexes on the relevant fields. In particular, consider indexing the time field, but only on the rows that match the filter for signup events. That is:
CREATE INDEX event\_signups ON event (time) WHERE (data->>'type') = 'submit' AND (data->>'path') = '/signup/' Using this index, our test query runs in 200ms initially and 2ms on subsequent runs, so there’s a performance improvement to be had if we are running this query often. More importantly, the partial index addresses all three disadvantages of the triple-index approach described above.
It only takes up 96 kb, a 10000x improvement over the 1026 mb required to fully index all three fields.
We only need to write to the partial index when new rows match our filter for signup events. Since those are 0.03% of our events, this constitutes a significant improvement in write performance: we essentially get this index for free.
The partial join allows for filtering with the full expressiveness of PostgreSQL. Our index’s
WHEREclause can use any row-filtering expression that we could otherwise use in a query. We can use this approach to match more complicated event definitions, such as regexes, function results, or the prefix-matching example mentioned above.
Avoid Indexing Predicate Results As Booleans
Another approach I’ve seen is to attempt to index the boolean expression
(data->>'type') = 'submit' AND (data->>'path') = '/signup/' directly, with a second term for time. I.e., something like:
CREATE INDEX event\_signup\_time ON event (((data->>'type') = 'submit' AND (data->>'path') = '/signup/'), time) This is worse than either of the approaches above, as it turns out PostgreSQL’s query planner won’t figure out that our example query restricts our result to rows for which the first term in the index is
true. That is, the planner doesn’t know that the
WHERE (data->>'type') = 'submit' AND (data->>'path') = '/signup/' is equivalent to saying that the
((data->>'type') = 'submit' AND (data->>'path') = '/signup/') field of our index must be
true. So it uses this index to restrict the time range of events in our scan, but it reads events for which the boolean expression is
true as well as the events for which it is
false, and then checks that condition for each row after loading it. [*4]
So, we’re reading a lot more rows from disk than necessary, and also executing a nontrivial check for each one. On our synthetic dataset, this query takes 25 seconds for the first run and 8 seconds on successive executions. Under the hood, this is slightly worse than just indexing on the time field, and it performs comparably to doing just that.
Partial indexes are a powerful way to precompute the subset of a table that matches a predicate. Judging by traffic in the #postgresql IRC, they are considerably underused. Compared to full indexes, they allow a greater range of predicates. They can be dramatically more lightweight, requiring fewer writes and less disk space, especially for highly selective filters. Partial indexes should be your default strategy if you’re repeatedly querying a table for a small slice of rows.
Love PostgreSQL? Know of a feature the world should know more about? Ping me @danlovesproofs.
[*1] We might deal with this kind of a situation with table partitioning, in which high-value events and low-value events can live in separate child tables, but this approach is going to be unwieldy if there are many different kinds of high-value events, and we would need to repartition the table every time we wanted to add a new type of high-value event.
[*2] We may get some of the updates for free via the heap-only tuples optimization, but, at least, every
DELETE will require writing to all three indexes.
[*3] We could index all three fields in one multi-column index, e.g. on
((data->>'type'), (data->>'path'), time). That would take up 755 mb, a 26% savings, but this is still large, and the other drawbacks still apply. What’s more, this index would be less applicable to other queries we might want to run on this data, so, if we’re supporting a few different kinds of high-value events, this might not end up saving us any space.
[*4] The relevant query plan: