Heap is a web and iOS analytics tool that automatically captures every user interaction, eliminating the need to define events upfront and allowing for flexible, retroactive analysis.
When we had the idea for Heap, it wasn’t clear whether its underlying tech would be financially tenable.
Plenty of existing tools captured every user interaction, but none offered much beyond rigid, pre-generated views of the underlying data. And plenty of tools allowed for flexible analysis (funnels, segmentation, cohorts), but only by operating on pre-defined events that represent a small subset of overall usage.
To our knowledge, no one had built: 1) ad-hoc analysis, 2) across a userbase’s entire activity stream. This was intimidating. Before we started coding, we needed to estimate an upper-bound on our AWS costs with order-of-magnitude accuracy. Basically: “Is there a sustainable business model behind this idea?”
To figure this out, we started with the smallest unit of information: a user interaction.
Estimating Data Throughput
Every user interaction triggers a DOM event. We can model each DOM event as a JSON object:
target: 'div#gallery div.next',
With all the properties Heap captures, a raw event occupies ~1 kB of space.
Our initial vision for Heap was to offer users unadulterated, retroactive access to the DOM event firehose. If you could bind an event handler to it, we wanted to capture it. To estimate the rate of DOM event generation, we wrote a simple script:
var start = Date.now(),
// Find all DOM events we can bind a listener to
console.log('Average events per second: ' + eventCount / elapsed);
Try it out yourself. With steady interaction, you’ll generate ~30 DOM events per second. Frenetic activity nets ~60 events per second. That’s a lot of data, and it resulted in an immediate bottleneck: the client-side CPU and network overhead.
Luckily, this activity mostly consists of low-signal data:
keypress, etc. Customers don’t care about these events, nor can they meaningfully quantify it. By restricting our domain to high-signal events –
change, push state events, page views – we can reduce our throughput by almost two orders of magnitude with negligible impact on data fidelity.
With this subset of events, we found via manual testing that sessions rarely generate more than 1 event per second. We can use this as a comfortable upper-bound. And how long is the average session duration? In 2011, Google Analytics provided aggregate usage benchmarks and their latest figures claimed an average session lasted about 5 minutes and 23 seconds.
Note that the estimate above is the most brittle step of our analysis. It fails to account for the vast spectrum in activity across different classes of apps (playing a game of Cookie Clicker is more input-intensive than reading an article on The Economist). But we’re not striving for perfect accuracy. We just need to calculate an upper-bound on cost that’s within the correct order of magnitude.
By multiplying the values above, we find that a typical web session generates 323 kB of raw, uncompressed data.
Architectural Assumptions and AWS
We have a sense of the total data generated by a session, but we don’t know the underlying composition. How much of this data lives on RAM? On SSD? On spinning disks?
To estimate, we made a few assumptions about our nascent infrastructure, making sure to err on the side of over-performance and increased costs:
Queries need to be fast. Because lots of data would be access in an ad-hoc fashion, we presumed our cluster would be I/O bound. Thus, we intended to keep as much of the working set in memory as possible.
Therefore, the last month of data needs to live in RAM. We assumed the lion’s share of analysis would take place on recent data. These queries need to be snappy, and the simplest way of ensuring snappiness is by throwing it all into memory. An aggressive goal, but not unreasonable.
Data older than a month needs to live in SSDs. Given AWS’s reputation for fickle I/O, we made the assumption that spinning disks wouldn’t suffice, on either EBS or ephemeral stores. Provisioned IOPS helps, but offers a maximum throughput of 4k IOPS per volume, which is far less than the 10k-100k IOPS we measured with SSDs.
We need to use on-demand instances for everything. If the business model only works with (cheaper) 1-year or 3-year reserved instances, then we’d need to commit much more capital upfront. We’d likely be cash-flow negative from day 1, thereby increasing the company’s risk and forcing us to raise more money. We also needed to assume any early-stage architecture would be in constant flux.
With AWS’s on-demand instances, we identified several storage options. (Note that the new I2 instances didn’t exist yet.)
RAM on High-Memory Quadruple Extra Large, which offers the cheapest cost/memory ratio at $10.33/GB/month.
SSD on High I/O Quadruple Extra Large, which offers the cheapest cost/SSD ratio at $1.09/GB/month.
Spinning disk on EBS or S3, at $0.10/GB/month.
You can see a stark difference in costs across each:
Amazon’s pricing page is frustratingly inconducive to price analysis, so we consulted the always-wonderful ec2instances.info.
RAM is an order-of-magnitude more expensive than SSDs, which in turn are an order of magnitude more expensive than spinning disks. Each drop-off is almost exactly 10x. Because memory is the dominant factor in our analysis, we can simplify calculations by focusing exclusively on the expected cost of RAM.
After calculating the expected size of a visit and the price of RAM, we estimated a cost of (323 kB/visit) × ($0.0000103/kB/month) = $0.0033 (0.33 cents) per visit per month. Put another way: for Heap’s business model to work, a visit needs to offer on average one-third a cent of value to our customers.
With this figure, we reached out to a range of companies – small to medium-sized, web and mobile, e-commerce/SaaS/social – and based on their monthly visits, explicitly asked each one “Would you pay $X to eliminate most manual event instrumentation?” Their enthusiastic responses gave us the confidence to start coding.
This estimate was indeed within the correct order of magnitude. But as our pricing page shows, we charge quite a bit less than 0.33 cents per visit. We aren’t burning money with each visit. Our estimates were just a bit off.
A few unforeseen factors reduced costs:
Compression. The complexity of an app or site’s markup doesn’t matter: when users click, they tend to click on the same things. This creates a lot of redundancy in our data set. In fact, we’ve seen a compression factor of up to 5x when storing data via Postgres.
CPU. Our queries involve a large amount of string processing and data decompression. Much to our surprise, this caused our queries to become CPU-bound. Instead of spending more money on RAM, we could achieve equivalent performance with SSDs (which are far cheaper). Though we also needed to shift our costs towards more CPU cores, the net effect was favorable.
Reserved Instances. Given the medium-term maturity of our infrastructure, we decided to migrate our data from on-demand instances to 1-year reserved instances. Our instances are heavily utilized, with customers sending us a steady stream of queries throughout the day. Per the EC2 pricing page, this yields 65% yearly savings.
On the other hand, there were a couple of unexpected factors that inflated costs:
AWS Bundling. By design, no single instance type on AWS strictly dominates another. For example, if you decide to optimize for cost of memory, you may initially choose cr1.8xlarge instances (with 244GB of RAM). But you’ll soon find yourself outstripping its paltry storage (240 GB of SSD), in which case you’ll need to switch to hs1.8xlarge instances, which offer more disk space but at a less favorable cost/memory ratio. This makes it difficult to squeeze savings out of our AWS setup.
Data Redundancy. This is a necessary feature of any fault-tolerant, highly-available cluster. Each live data point needs to be duplicated, which increases costs across the board by 2x.
Sound estimation is critical, especially for projects that contain an element of technical risk. As we’ve expanded our infrastructure and scaled to a growing userbase, we’ve found these techniques invaluable in guiding our day-to-day work.