Speeding Up PostgreSQL With Partial Indexes

Did you know PostgreSQL supports indexing a subset of your table? This enables very fast reads from that subset with almost no index overhead. It’s often the best way to index your data if you want to repeatedly analyze rows that match a given WHERE clause. This makes PostgreSQL a great fit for workflows that involve pre-aggregation with additional ad hoc analysis. In this post, I’ll walk through an example query optimization in which partial indexes are well suited.

Consider a table of events with the following schema:

Each event is associated with a user and has an ID, a timestamp, and a JSON representation of the event. The JSON includes the page path, the event type (e.g. click, page view, form submission), and any other properties that describe the event.

We can use this table to store many different kinds of events, and let’s assume we have some automatic event tracking in place that logs every click, page view, and form submission so that we can analyze it later. We might want to have an internal dashboard that displays some high-value metrics, such as the number of weekly signups or the amount of revenue we are collecting per day. The events that are relevant to this dashboard make up a small fraction of this table — a very small percentage of clicks on your website are to finalize purchases! But they are mixed in with the rest of the table, so our signal-to-noise ratio is low.

We would like to index our data to make these dashboard queries fast.[1] Let’s start with a signup event, which we define to be a form submission on our /signup/ page. Getting the number of such signups in the first week of September translates to:

On a synthetic dataset with 10 million events, of which 3000 are signups, and without any indexes, this query takes 45 seconds.

Full Indexes On Single Columns: A Mixed Bag

A naive way to improve this performance is by creating single-column indexes for each of the relevant event features: (data->>'type'), (data->>'path'), and time. We can use a bitmap join between results from three indexed scans, which should be fast if the query is selective and the relevant index portions are in memory. Indeed, with these indexes in place, this query takes 200 ms initially, and 20 ms in subsequent runs on our synthetic dataset — a significant improvement over the 45 seconds required by a sequential scan.

But this indexing strategy has a few significant downsides:

  • Write overhead. We need to write to all three indexes on every INSERT/UPDATE/DELETE run against the table.[2] For a write-heavy workload like the one in this example, that might be too expensive.
  • Limited query set. This strategy restricts our ability to define types of high-value events. It won’t work if we need something more complicated than a range query on one of those JSON fields. What if we want to match a regex, or all pageview paths that start with /signup/ but can include anything thereafter?
  • Disk use. The table in our test dataset takes up 6660 mb, and the three indexes take up a combined 1026 mb, a substantial increase in the amount of hard drive space we need to use to support this table.[3]

Enter Partial Indexes

We’re only analyzing the 0.03% of the table that constitutes signup events, but this strategy indexes all the rows. We want to be able to run efficient queries over a tiny fraction of a table. In a case like this, a partial index is going to give us the best results.

If we index an unrelated column and restrict our index to events that match our signup definition, PostgreSQL can easily determine where the signup event rows are, and we can query over them even more efficiently than with three complete indexes on the relevant fields. In particular, consider indexing the time field, but only on the rows that match the filter for signup events. That is:

CREATE INDEX event_signups ON event (time)
WHERE (data->>'type') = 'submit' AND (data->>'path') = '/signup/'

Using this index, our test query runs in 200ms initially and 2ms on subsequent runs, so there’s a performance improvement to be had if we are running this query often. More importantly, the partial index addresses all three disadvantages of the triple-index approach described above.

  • It only takes up 96 kb, a 10000x improvement over the 1026 mb required to fully index all three fields.
  • We only need to write to the partial index when new rows match our filter for signup events. Since those are 0.03% of our events, this constitutes a significant improvement in write performance: we essentially get this index for free.
  • The partial join allows for filtering with the full expressiveness of PostgreSQL. Our index’s WHERE clause can use any row-filtering expression that we could otherwise use in a query. We can use this approach to match more complicated event definitions, such as regexes, function results, or the prefix-matching example mentioned above.

Avoid Indexing Predicate Results As Booleans

Another approach I’ve seen is to attempt to index the boolean expression

(data->>'type') = 'submit' AND (data->>'path') = '/signup/'

directly, with a second term for time. I.e., something like:

CREATE INDEX event_signup_time ON event
(((data->>'type') = 'submit' AND (data->>'path') = '/signup/'), time)

This is worse than either of the approaches above, as it turns out PostgreSQL’s query planner won’t figure out that our example query restricts our result to rows for which the first term in the index is true. That is, the planner doesn’t know that the WHERE clause:

WHERE (data->>'type') = 'submit' AND (data->>'path') = '/signup/'

is equivalent to saying that the

((data->>'type') = 'submit' AND (data->>'path') = '/signup/')

field of our index must be true. So it uses this index to restrict the time range of events in our scan, but it reads events for which the boolean expression is true as well as the events for which it is false, and then checks that condition for each row after loading it.[4]

So, we’re reading a lot more rows from disk than necessary, and also executing a nontrivial check for each one. On our synthetic dataset, this query takes 25 seconds for the first run and 8 seconds on successive executions. Under the hood, this is slightly worse than just indexing on the time field, and it performs comparably to doing just that.

Partial indexes are a powerful way to precompute the subset of a table that matches a predicate. Judging by traffic in the #postgresql IRC, they are considerably underused. Compared to full indexes, they allow a greater range of predicates. They can be dramatically more lightweight, requiring fewer writes and less disk space, especially for highly selective filters. Partial indexes should be your default strategy if you’re repeatedly querying a table for a small slice of rows.

Love PostgreSQL? Know of a feature the world should know more about? Ping me @danlovesproofs.

Interested in building systems that make powerful technology easy to use? Shoot us a note at jobs@heapanalytics.com.

[1] We might deal with this kind of a situation with table partitioning, in which high-value events and low-value events can live in separate child tables, but this approach is going to be unwieldy if there are many different kinds of high-value events, and we would need to repartition the table every time we wanted to add a new type of high-value event.
[2] We may get some of the updates for free via the heap-only tuples optimization, but, at least, every INSERT and DELETE will require writing to all three indexes.
[3] We could index all three fields in one multi-column index, e.g. on ((data->>'type'), (data->>'path'), time). That would take up 755 mb, a 26% savings, but this is still large, and the other drawbacks still apply. What’s more, this index would be less applicable to other queries we might want to run on this data, so, if we’re supporting a few different kinds of high-value events, this might not end up saving us any space.
[4] The relevant query plan:

PostgreSQL’s Powerful New Join Type: LATERAL

PostgreSQL 9.3 has a new join type! Lateral joins arrived without a lot of fanfare, but they enable some powerful new queries that were previously only tractable with procedural code. In this post, I’ll walk through a conversion funnel analysis that wouldn’t be possible in PostgreSQL 9.2.

What is a LATERAL join?

The best description in the documentation comes at the bottom of the list of FROM clause options:

The LATERAL key word can precede a sub-SELECT FROM item. This allows the sub-SELECT to refer to columns of FROM items that appear before it in the FROM list. (Without LATERAL, each sub-SELECT is evaluated independently and so cannot cross-reference any other FROM item.)

When a FROM item contains LATERAL cross-references, evaluation proceeds as follows: for each row of the FROM item providing the cross-referenced column(s), or set of rows of multiple FROM items providing the columns, the LATERAL item is evaluated using that row or row set’s values of the columns. The resulting row(s) are joined as usual with the rows they were computed from. This is repeated for each row or set of rows from the column source table(s).

This is a bit dense. Loosely, it means that a LATERAL join is like a SQL foreach loop, in which PostgreSQL will iterate over each row in a result set and evaluate a subquery using that row as a parameter.

What can we do with this?

Consider a table of click events with the following schema:

Each event is associated with a user and has an ID, a timestamp, and a JSON blob with the event’s properties. At Heap, these properties might include the DOM hierarchy of a click, the window title, the session referrer, and so forth.

Let’s say we want to optimize our landing page to increase signups. The first step is to figure out where we’re losing users in our conversion funnel.

An example conversion funnel between four steps in a signup flow.
An example conversion funnel between four steps in a signup flow.

We’ll assume that we’ve instrumented our frontend to log events along this flow and that all of the data lives in the event table specified above.[1] As an initial question, let’s figure out how many people view our homepage and what percentage of them enter a credit card within two weeks of that initial homepage view. If we were using an older version of PostgreSQL, we might write some custom functions in PL/pgSQL, PostgreSQL’s builtin procedural language. But, in 9.3, we can use a lateral join to compute this in one efficient query, with no extensions or PL/pgSQL.

Nobody likes 30-line SQL queries, so let’s break this down into pieces. The first chunk of this is vanilla SQL:

That is, get the initial time each user did a view_homepage event. Then our lateral join allows us to iterate over each resulting row and perform a parametrized version of the next subquery. This is equivalent to taking the query below and running it for each resulting row:

I.e., for each user, get the first time he or she performed the enter_credit_card event within two weeks of view_homepage_time. Because this is a lateral join, our subquery can make reference to the view_homepage_time results from the previous subquery. Otherwise, the subqueries would be evaluated independently and there would be no way to access results from one when evaluating the other.

Then we wrap the whole thing in a select, which returns something like this:

Because this is a LEFT JOIN, the query still produces result rows for users with no matching enter_credit_card event, as long as there is a view_homepage event. If we aggregate over the numerical columns, we get a tidy summary of this conversion funnel:

… which produces:

We can add intermediate steps to this funnel with more lateral joins to evaluate which portions of our flow we should focus on improving.[2] Let’s add a use_demo step between viewing the homepage and entering a credit card.

This yields:[3]

This gives us the three-step conversion funnel from viewing the homepage to using the demo within a week to entering the credit card within a week of that. From here, the expressive power of PostgreSQL allows us to drill down on these results and thoroughly analyze the performance of our website. We might follow up with:

  • Does using the demo increase the likelihood of a signup?
  • Do users who discover our homepage via an ad convert with the same likelihood as users from other sources?
  • How do these conversion rates change over time or with different A/B test variants?

The answers to these questions apply directly to product changes and can be determined in PostgreSQL now that it supports lateral joins.

Without lateral joins, we would need to resort to PL/pgSQL to do this analysis. Or, if our data set were small, we could get away with complex, inefficient queries. In an exploratory data science use case, you might just pull your data out of PostgreSQL and analyze it with your scripting language of choice. But there is considerable power in being able to express these questions in SQL, especially if you’re wrapping it all in an approachable UI and exposing the functionality to nontechnical users.

Note that these queries can be tuned so as to be very efficient. In this example, if we create a btree index on (user_id, (data->>’type’), time), we can evaluate each funnel step for each user with a single indexed lookup. If you’re using SSDs, on which seeks are cheap, this might be good enough. If not, you might need to schematize your data a bit differently, but I’ll leave the details of that for another post.

Have a favorite new PostgreSQL feature or a neat lateral join use case? Ping me @danlovesproofs.

Interested in building systems that make powerful features easy to use? Shoot us a note at jobs@heapanalytics.com.

[1] Or we can use Heap, which captures everything for us as soon as we install it! No need to write any logging code, and no risk of forgetting to log something that you’ll want to analyze later.
[2] Note that adding additional steps to a conversion funnel would be particularly easy if we were using a product like Heap, since we would already have the relevant data.
[3] The number of users with enter_credit_card events is lower in this query than in the previous one, as this query only returns enter_credit_card events for users who do so after doing the use_demo event, and 17 of the users signed up without using the demo.

Building Automated Analytics Logging for iOS Apps

Analytics is often the first tool developers add to their iOS app. A standard approach is to write logging code like this:

Manual logging calls

Let’s call this manual event-tracking. With manual event-tracking, you write logging code for each analytics event you care about. A logEvent: for signing in, a logEvent: for inviting a friend, a logEvent: for opening the Settings view, and so forth.

Here, we’ll describe a different approach: automatic event-tracking. With automatic event-tracking, client-side events are automatically logged. Usage can be analyzed without having to define events upfront, without writing code, without going through the App Store, and with virtually no performance overhead.

In this post, we’ll provide a blueprint for building automatic event-tracking into your own app. Heap’s analytics library for iOS offers automatic event-tracking, but this information should be useful to you even if you don’t use Heap.

The Problem with Manual Event-Tracking

Let’s say you’ve launched your new iOS app. You took painstaking care to craft the perfect launch. You designed a beautiful landing page, appeared in major press outlets, and generated 5,000 signups in 2 days. Woo.

But after the initial fanfare subsides, you start to think: “Where did all my visitors drop off in our signup form?” “Which source of traffic had the highest conversion rate?”

Oh, right. In the frenzy of shipping, you forgot to instrument your signup flow.

Now your next two weeks will be spent:

  1. Mourning launch data that’s forever lost.
  2. Instrumenting your signup flow with logging code.
  3. Waiting for your app to get approved by Apple.
  4. Waiting.
  5. Waiting some more for data to trickle in.
  6. Analyzing your data.
Broken logging calls

This is what happens when a bug in your code breaks logging. You lose data forever. Source.

This is the fundamental issue with manual event-tracking. Tagging each event of interest in advance is brittle and often results in stale data. And since you don’t know what you don’t know, every new question requires new logging code, resulting in a lengthy feedback loop.

How Do We Build Automatic Event-Tracking?

On web, there’s only one event abstraction, so the answer is obvious: log every DOM event.

On iOS, we have more options:

  1. Log every touch event, in the form of a UIEvent or UIGestureRecognizer.
  2. Log every call to an action method with target-action or Interface Builder.
  3. Log every appearance of a view controller.
  4. Log every notification issued through NSNotificationCenter.

The correct automated framework will map as closely as possible to typical logging calls. There should also be enough context around each event, so that we can easily pinpoint salient interactions post-hoc.

Where do iOS developers normally place logging calls? A GitHub search for Flurry’s logEvent: method surfaces lots of code like this:

A search for analytics logging

There’s an obvious connection here between action methods and logging calls. Source.

This is only a single example, but the pattern is clear: analytics calls are typically made in action methods. If we log action methods as they occur, then we can save ourselves the overhead of sprinkling analytics code throughout our app.

(Note that Heap’s iOS library not only auto-logs action methods, but also UIEvents and view controllers. The general approach is the same for all three event types, but in this blog post, we’ll focus on implementing the former.)

In UIKit, every user interaction on a control passes a sendAction:to:from:forEvent: message to the global UIApplication object. If we extend sendAction:to:from:forEvent: to log each occurrence with its parameters, we’ll have a complete picture of the user’s flow through our app.

There’s one problem, though. The sendAction:to:from:forEvent: method is a built-in Cocoa method. We can’t modify built-in methods, so how can we hope to log each occurrence?

We’ll need to use method swizzling.

What is Method Swizzling?

Simply put, method swizzling allows you to extend an existing method implementation.

Subclassing and categories are common techniques for extending methods. But method swizzling is unique in that it works without forcing you to modify existing calls to the original method and without requiring access to the original method’s implementation. This means we can modify the behavior of built-in Cocoa methods.

This is very powerful. And it’s exactly how we’ll auto-log sendAction:.

Method swizzling is made possible by Objective-C’s flexible runtime. With a single #import <objc/runtime.h>, we gain access to various low-level goodies. In particular, we’ll use method_exchangeImplementations, which (unsurprisingly) lets you swap the implementations of two methods.

The final result looks like:

Here’s what’s happening step-by-step:

  1. First, we extend UIApplication with an EventAutomator category that contains a new method heap_sendAction:. This method will replace the built-in sendAction:. (A unique prefix is crucial in order to avoid namespace collisions.)
  2. Then, in our category’s +load method, we swap our method with the original using method_exchangeImplementations. This works because the Objective-C runtime special-cases +load to be invoked for every category within a class, even if +load is defined multiple times.
  3. Finally, we implement heap_sendAction: to log the action selector argument and then call the original sendAction: method.

At first glance, heap_sendAction: appears to be infinitely recursive. But remember that we swapped the implementation of heap_sendAction: with sendAction:. Thus, at runtime, sending the heap_sendAction: message actually calls code within sendAction:.

Method swizzling can be a bit of a mind-bender.

Note that this implementation of method swizzling is incomplete. For instance, what if the method we’re swizzling is defined within a superclass? Then the swap will ultimately fail, because the original method is missing from the class.

To avoid hairy edge cases like this, just use a method swizzling library. You should have little reason to deviate from existing implementations like JRswizzle or NSHipster’s.

The Results in Action

Though we’ve only written a few lines of code, this works! Watch it in action on Coffee Timer:

Automatic event tracking in action!

Remember: we’re getting this data without having written any logging code!

Feel free to dig into the underlying source code.

Method swizzling is central to building out an automatic analytics framework. Using the code we just wrote, we won’t need to worry about defining events of interest upfront. And whenever we need to analyze a new interaction, we won’t need to go through the bother of shipping new code to the App Store.

The Heap iOS library further expands upon this idea by auto-logging other types of events, including all UIEvents and every appearance of a view controller.

Performance Optimizations

If you’re wondering whether automatic event-tracking affects your app’s performance, know this: the performance impact is negligible. Our approach needs a few refinements, though.

Here are some of the performance optimizations we implement in Heap’s iOS library. If you’re considering building your own analytics, we recommend you implement them as well.

Batch network requests.

Network requests power up a device’s cell radio. A naive approach to automatic event-tracking – where every event is transmitted as soon as it occurs – would keep the radio powered for the entirety of a session. This is wasteful.

The solution is to transmit bursts of data at fixed time intervals (say, every 15 seconds). This keeps the radio triggered for a much smaller portion of time and conserves battery life, at the cost of a minor decrease in data recency.

Using the Coffee Timer app as our testbed, we see that automatic event-tracking puts minimal strain on battery life and CPU. The graphs below profile the energy/CPU usage of two highly-active sessions. Heap is active during one of these sessions and disabled during the other.

Can you tell which set of metrics correspond to which session? Neither can we. Even with frantic user input, our automatic event-tracker rarely exceeds 1% CPU usage.

A session with Heap enabled.
A session without Heap enabled.

Performance stats for nearly identical 1-minute sessions. Rate of activity was very high (~90 interactions).
Heap is on top, vanilla app is on bottom.

Process events off the main thread.

Method swizzling itself is lightweight, as it’s just a pointer swap at runtime. But before you send events to your servers, you’ll likely need to perform some client-side preprocessing. For instance, the Heap iOS library munges events into a URL format, tags events with metadata such as App Version, Carrier, and Device Model, and sanitizes event properties.

Don’t do this work on the main thread. Instead, push processing work onto a background queue (using Grand Central Dispatch or a NSOperationQueue). This unblocks UI rendering on the main thread and ensures your app remains snappy even under heavy load.

The Future of Mobile Analytics

This approach works for iOS apps built on Cocoa. It’s extremely performant, provides a comprehensive view of a user’s activity, and saves you the hassle of writing and maintaining analytics code.

But there’s plenty to be done to make this more broadly applicable. How do we handle games built on Cocos2d or Unity? How is automatic event-tracking implemented in Android apps? We hope to answer these questions in a follow-up post, as well as open-source our own iOS integration.

Let us know what you think on Twitter. And of course, if you’d prefer to use a battle-tested automatic event-tracking tool for iOS, consider using Heap.

Creating PostgreSQL Arrays Without A Quadratic Blowup

At Heap, we lean on PostgreSQL for most of the backend heavy lifting.[1] We store each event as an hstore blob, and we keep a PostgreSQL array of events done by each user we track, sorted by time. Hstore lets us attach properties to events in a flexible way, and arrays of events give us great performance, especially for funnel queries, in which we compute the drop off between different steps in a conversion funnel.[2]

In this post, we’ll take a look at a PostgreSQL function that hangs on large inputs and rewrite it in an efficient, idiomatic way.

If you’re writing a PL/pgSQL function that returns an array, it can be tempting to create an empty array and build up results in a loop with array_append or array_cat. But, as is often the case, procedural idioms in a relational database result in bad times.

Consider an example function in which we create an array of hstores such that the entry at position i is "num=>i".

Looks simple enough, but this is bad news. This takes a quadratic amount of time to run, blowing up to 36 seconds to generate an array of 100k elements.

Execution Times For blowup

Test queries were timed on a MacBook Pro with a 2.4GHz i7 and 16 GB of RAM.

What’s going on here? It turns out the repeated calls to array_append cause this quadratic explosion. When we call result := array_append(result, ...), PostgreSQL allocates a new array that’s wide enough for the result of the array_append call and then copies the data in. That is, array_append(array, new_element) is linear in the length of array, which makes the implementation above O(N2).

A lot of languages handle this idiom more gracefully. A common strategy is to resize the array that backs a list to double the existing size. With a list written this way, repeated appends would only require us to execute the “grow the array and copy over your data” operation a logarithmic number of times, and our amortized runtime would be linear.

So, PostgreSQL could be smarter, but this is not an idiomatic implementation, so we shouldn’t expect it to be. The correct way to do this is with array_agg — an aggregate function that takes a set and returns all of the entries as an array.

This is fast, and it scales linearly. It takes 300 ms to generate an array of 100k elements — a 100x reduction. This query allows PostgreSQL to generate the complete set of results before materializing any arrays, so it doesn’t need to do so more than once. PostgreSQL can compute upfront exactly how long the resulting array needs to be and only needs to do one allocation.

Lesson learned: if you find yourself calling array_append or array_cat in a loop, use array_agg instead.

When you’re working with any relational database, you should reconsider any procedural code you find yourself writing. Also, it helps to have an intuition for how long something “should” take. Generating a 100k element array (with around one megabyte of total data) shouldn’t take thirty seconds, and, indeed, it doesn’t need to.

We like learning stuff. Have any feedback or other PostgreSQL tips? Shoot us a note @heap.

[1] In particular, we use a lovely tool called Citus Data. More on that in another blog post!
[2] See: https://heapanalytics.com/features/funnels. In particular, computing a conversion funnel requires a single scan over the array of events a user has done and doesn’t require any joins.

Using PostgreSQL Arrays The Right Way

At Heap, we lean on PostgreSQL for most of the backend heavy lifting.[1] We store each event as an hstore blob, and we keep a PostgreSQL array of events done by each user we track, sorted by time. Hstore lets us attach properties to events in a flexible way, and arrays of events give us great performance, especially for funnel queries, in which we compute the drop off between different steps in a conversion funnel.[2]

In this post, we’ll take a look at a PostgreSQL function that unexpectedly hung on large inputs and rewrite it in an efficient, idiomatic way.

Your first instinct might be to treat arrays in PostgreSQL like their analogues in C-based languages. You might be used to manipulating data via array positions or slices. Be careful not to think this way in PostgreSQL, especially if your array type is variable length, e.g. json, text, or hstore. If you’re accessing a PostgreSQL array via its positions, you’re in for an unexpected performance blowup.

This came up a few weeks ago at Heap. We keep an array of events for each user tracked by Heap, in which we represent each event with an hstore datum. We have an import pipeline that appends new events to the right arrays. In order to make this import pipeline idempotent, we give each event an event_id entry, and we run our event arrays through a utility function that squashes duplicates. If we want to update the properties attached to an event, we just dump a new event into the pipeline with the same event_id.

So, we need a utility function that takes an array of hstores, and, if two events have the same event_id, it should take the one that occurs later in the array. An initial attempt to write this function looked like this:

This works, but it blows up for large inputs. It’s quadratic, and it takes almost 40 seconds for an input array of 100k elements!

Execution Times For dedupe_events_1

Test queries were timed on a macbook pro with a 2.4GHz i7 and 16 GB of ram and generated with this script: https://gist.github.com/drob/9180760.

What’s going on here? The issue is that PostgreSQL stores an array of hstores as an array of values, not an array of pointers to values. That is, an array of three hstores looks something like
{“event_id=>1,data=>foo”, “event_id=>2,data=>bar”, “event_id=>3,data=>baz”}
under the hood, as opposed to
{[pointer], [pointer], [pointer]}

For types that are variable length, e.g. hstores, json blobs, varchars, or text fields, PostgreSQL has to scan the array to find the Nth element. That is, to evaluate events[2], PostgreSQL parses events from the left until it hits the second entry. Then, for events[3], it re-scans from the first index all over again until it hits the third entry! So, evaluating events[sub] is O(sub), and evaluating events[sub] for each index in the array is O(N2), where N is the length of the array.

PostgreSQL could be smarter about caching intermediate parse results, or it could parse the array once in a context like this. The real answer is for arrays of variable-length elements to be implemented with pointers to values, so that we can always evaluate events[i] in constant time.

Even so, we shouldn’t rely on PostgreSQL handling this well, since this is not an idiomatic query. Instead of generate_subscripts, we can use unnest, which parses an array and returns a set of entries. This way, we never have to explicitly index into the array.

This is efficient, and it takes an amount of time that’s linear in the size of the input array. It takes about a half a second for an input of 100k elements, compared to 40 seconds with the previous implementation.

This does what we want:

  • Parse the array once, with unnest.
  • Partition by event_id.
  • Take the last occurrence for each event_id.
  • Sort by input index.

Lesson learned: if you find yourself accessing positions in a PostgreSQL array, consider unnest instead.

We like to avoid stubbing our toes. Have any feedback or other PostgreSQL tips? Shoot us a note @heap.

[1] In particular, we use a lovely tool called Citus Data. More on that in another blog post!
[2] See: https://heapanalytics.com/features/funnels. In particular, computing a conversion funnel requires a single scan over the array of events a user has done and doesn’t require any joins.

The Event Visualizer

Today we’re launching the Event Visualizer, which makes analytics integration as simple as clicking around your own website. For the first time, people with zero coding knowledge can start tracking events and generate important metrics instantly.

Watch it in action below:

Some of the design goals of the Event Visualizer:

  • The easiest possible integration. There’s no need to write code. Just use the point and click interface to define events for use in funnels or graphs.
  • Instant, retroactive metrics. Forgot to log a certain interaction? Not an issue – just redefine an event via the visualizer, and it will include all historical data from day 1.
  • Data you’ll understand. Sometimes teammates assign bewildering names to events: SHOPPING_FLOW_START or Login Event v2 newest. Instead, with this tool, you can visually see which on-screen actions correspond with which events.

We built the Event Visualizer because we found that non-technical people are becoming more and more dependent on analytics to make decisions. Marketers need to measure engagement across traffic sources, product managers need to quantify feature usage, salespeople need to identify promising leads, and designers need to understand paths their users take.

But the people consuming this data aren’t the people collecting this data. As a result, companies are bottlenecked on engineers to manually instrument the right events for them.

We think we’ve overcome this bottleneck. The Event Visualizer is our take on bringing analytics to the masses.

To get started, just sign up at https://heapanalytics.com.

The Event Visualizer is currently only available for web. Tweet us @heap if you’re interested in beta-testing our iOS visualizer.

How We Estimated Our AWS Costs Before Shipping Any Code

Heap is a web and iOS analytics tool that automatically captures every user interaction, eliminating the need to define events upfront and allowing for flexible, retroactive analysis.

When we had the idea for Heap, it wasn’t clear whether its underlying tech would be financially tenable.

Plenty of existing tools captured every user interaction, but none offered much beyond rigid, pre-generated views of the underlying data. And plenty of tools allowed for flexible analysis (funnels, segmentation, cohorts), but only by operating on pre-defined events that represent a small subset of overall usage.

To our knowledge, no one had built: 1) ad-hoc analysis, 2) across a userbase’s entire activity stream. This was intimidating. Before we started coding, we needed to estimate an upper-bound on our AWS costs with order-of-magnitude accuracy. Basically: “Is there a sustainable business model behind this idea?”

To figure this out, we started with the smallest unit of information: a user interaction.

Estimating Data Throughput

Every user interaction triggers a DOM event. We can model each DOM event as a JSON object:

{
    referrer: 'https://www.google.com/search?q=banana',
    url: 'https://www.bananalytics.com/',
    type: 'click',
    target: 'div#gallery div.next',
    timestamp: 1387974232845
    ...
}

With all the properties Heap captures, a raw event occupies ~1 kB of space.

Our initial vision for Heap was to offer users unadulterated, retroactive access to the DOM event firehose. If you could bind an event handler to it, we wanted to capture it. To estimate the rate of DOM event generation, we wrote a simple script:

var start = Date.now(),
eventCount = 0;

for (var k in window) {
    // Find all DOM events we can bind a listener to
    if (k.indexOf('on') === 0) {
        window.addEventListener(k.slice(2), function(e){eventCount++});
    }
}

setInterval(function(){
    var elapsed = (Date.now() - start) / 1000;
    console.log('Average events per second: ' + eventCount / elapsed);
}, 1000);

Try it out yourself. With steady interaction, you’ll generate ~30 DOM events per second. Frenetic activity nets ~60 events per second. That’s a lot of data, and it resulted in an immediate bottleneck: the client-side CPU and network overhead.

Luckily, this activity mostly consists of low-signal data: mousemove, mouseover, keypress, etc. Customers don’t care about these events, nor can they meaningfully quantify it. By restricting our domain to high-signal events – click, submit, change, push state events, page views – we can reduce our throughput by almost two orders of magnitude with negligible impact on data fidelity.

With this subset of events, we found via manual testing that sessions rarely generate more than 1 event per second. We can use this as a comfortable upper-bound. And how long is the average session duration? In 2011, Google Analytics provided aggregate usage benchmarks and their latest figures claimed an average session lasted about 5 minutes and 23 seconds.

Note that the estimate above is the most brittle step of our analysis. It fails to account for the vast spectrum in activity across different classes of apps (playing a game of Cookie Clicker is more input-intensive than reading an article on The Economist). But we’re not striving for perfect accuracy. We just need to calculate an upper-bound on cost that’s within the correct order of magnitude.

By multiplying the values above, we find that a typical web session generates 323 kB of raw, uncompressed data.

Architectural Assumptions and AWS

We have a sense of the total data generated by a session, but we don’t know the underlying composition. How much of this data lives on RAM? On SSD? On spinning disks?

To estimate, we made a few assumptions about our nascent infrastructure, making sure to err on the side of over-performance and increased costs:

  1. Queries need to be fast. Because lots of data would be access in an ad-hoc fashion, we presumed our cluster would be I/O bound. Thus, we intended to keep as much of the working set in memory as possible.
  2. Therefore, the last month of data needs to live in RAM. We assumed the lion’s share of analysis would take place on recent data. These queries need to be snappy, and the simplest way of ensuring snappiness is by throwing it all into memory. An aggressive goal, but not unreasonable.
  3. Data older than a month needs to live in SSDs. Given AWS’s reputation for fickle I/O, we made the assumption that spinning disks wouldn’t suffice, on either EBS or ephemeral stores. Provisioned IOPS helps, but offers a maximum throughput of 4k IOPS per volume, which is far less than the 10k-100k IOPS we measured with SSDs.
  4. We need to use on-demand instances for everything. If the business model only works with (cheaper) 1-year or 3-year reserved instances, then we’d need to commit much more capital upfront. We’d likely be cash-flow negative from day 1, thereby increasing the company’s risk and forcing us to raise more money. We also needed to assume any early-stage architecture would be in constant flux.

With AWS’s on-demand instances, we identified several storage options. (Note that the new I2 instances didn’t exist yet.)

  • RAM on High-Memory Quadruple Extra Large, which offers the cheapest cost/memory ratio at $10.33/GB/month.
  • SSD on High I/O Quadruple Extra Large, which offers the cheapest cost/SSD ratio at $1.09/GB/month.
  • Spinning disk on EBS or S3, at $0.10/GB/month.

You can see a stark difference in costs across each:

Amazon’s pricing page is frustratingly inconducive to price analysis, so we consulted the always-wonderful ec2instances.info.

RAM is an order-of-magnitude more expensive than SSDs, which in turn are an order of magnitude more expensive than spinning disks. Each drop-off is almost exactly 10x. Because memory is the dominant factor in our analysis, we can simplify calculations by focusing exclusively on the expected cost of RAM.

Final Estimate

After calculating the expected size of a visit and the price of RAM, we estimated a cost of (323 kB/visit) × ($0.0000103/kB/month) = $0.0033 (0.33 cents) per visit per month. Put another way: for Heap’s business model to work, a visit needs to offer on average one-third a cent of value to our customers.

With this figure, we reached out to a range of companies – small to medium-sized, web and mobile, e-commerce/SaaS/social – and based on their monthly visits, explicitly asked each one “Would you pay $X to eliminate most manual event instrumentation?” Their enthusiastic responses gave us the confidence to start coding.

Unforeseen Factors

This estimate was indeed within the correct order of magnitude. But as our pricing page shows, we charge quite a bit less than 0.33 cents per visit. We aren’t burning money with each visit. Our estimates were just a bit off.

A few unforeseen factors reduced costs:

  1. Compression. The complexity of an app or site’s markup doesn’t matter: when users click, they tend to click on the same things. This creates a lot of redundancy in our data set. In fact, we’ve seen a compression factor of up to 5x when storing data via Postgres.
  2. CPU. Our queries involve a large amount of string processing and data decompression. Much to our surprise, this caused our queries to become CPU-bound. Instead of spending more money on RAM, we could achieve equivalent performance with SSDs (which are far cheaper). Though we also needed to shift our costs towards more CPU cores, the net effect was favorable.
  3. Reserved Instances. Given the medium-term maturity of our infrastructure, we decided to migrate our data from on-demand instances to 1-year reserved instances. Our instances are heavily utilized, with customers sending us a steady stream of queries throughout the day. Per the EC2 pricing page, this yields 65% yearly savings.

On the other hand, there were a couple of unexpected factors that inflated costs:

  1. AWS Bundling. By design, no single instance type on AWS strictly dominates another. For example, if you decide to optimize for cost of memory, you may initially choose cr1.8xlarge instances (with 244GB of RAM). But you’ll soon find yourself outstripping its paltry storage (240 GB of SSD), in which case you’ll need to switch to hs1.8xlarge instances, which offer more disk space but at a less favorable cost/memory ratio. This makes it difficult to squeeze savings out of our AWS setup.
  2. Data Redundancy. This is a necessary feature of any fault-tolerant, highly-available cluster. Each live data point needs to be duplicated, which increases costs across the board by 2x.

Sound estimation is critical, especially for projects that contain an element of technical risk. As we’ve expanded our infrastructure and scaled to a growing userbase, we’ve found these techniques invaluable in guiding our day-to-day work.

If this sort of thinking excites you, and you’re interested in building highly-scalable systems, reach out at jobs@heapanalytics.com.

Getting the details right in an interactive line graph

At Heap, we recently redesigned our line graphs with an eye towards user experience and data transparency. Line graphs are simple and well-known, and many of the changes we made may seem small or inconsequential. However, the net effect is quite dramatic, which you can see for yourself by interacting with the live graphs below.

Heap is web and iOS analytics tool that captures every user interaction and lets you analyze it later. This means that, when you want to answer a question with data, you can do it immediately, instead of writing code, deploying it, and waiting for metrics to trickle in over the span of weeks.

Heap’s Old Line Graph

Below is the old version of our interactive line graph:

This is a usable line graph, and many of the design decisions we made seemed reasonable.

  • Hover targets. The chart allows users to mouse over the vertices (dots) in order to see a tooltip showing the number value of the data point.
  • Interpolation. We made a decision to use monotone cubic interpolation to draw the lines. This causes the lines to look “curved” or “smoothed” between points, instead of a jagged line that merely connects the dots. We did this for aesthetic reasons.
  • Animation. When hovering over different vertices, the tooltip has a 100ms transition animation to the new location. This helps the interaction feel more fluid.

Let’s take a look at how this version of the graph fares with multiple series:

Aside from the addition of a legend, there are almost no changes here. A big problem was that we didn’t treat a multiple line graph as a different design problem than the single line graph, and it suffered as a result. We’ll see below how reconsidering this approach led to a lot of improvements.

The New Line Graph

Customer feedback and our own usage of the line graph helped us uncover several problems. We addressed these problems with our new interactive line graph, shown below:

There are a number of improvements to the single line graph.

  • Target size. The old targets were far too small. A user needed to align their mouse exactly with a target with a 5 pixel radius. Both repeated use and Fitts’s Law told us that this was a suboptimal interaction. The new version of the line graph displays values in a tooltip when mousing over any part of the chart area. It uses the x-value of the mouse to determine which vertex to target.
  • Animation. We lowered the tooltip animation length to 50ms, to eliminate jerky, distracting animations caused by large animation times on the old line graph. We didn’t eliminate animations entirely, however, since they give an impression of continuity. The animation also uses linear easing instead of d3’s default “cubic-in-out” easing, which allows for smoother transitions especially when moving the mouse across between many data points.
  • Less clutter. We removed the x-axis “Date” label, since the x-axis on our line graphs was always a time series, and people can recognize that an axis with labels like “Jun 1″ or “March 8 – March 15″ refers to time periods. There’s no need to include a label “Date” which takes up vertical space and adds nothing to comprehensibility. However, we retained the y-axis label, since units change across graph types (pageviews, visits, events, etc).
  • Linear interpolation. We got rid of the monotone cubic smoothing/interpolation of the lines, since it’s potentially misleading. Instead the lines between vertices are now straight.
  • Mouseleave interaction. When the mouse leaves the chart area, the tooltip disappears. This was an oversight in our previous version of the line graph.

The multiple line graph is also improved, and many of these improvements are a result of thinking about the design of the multiple line graph specifically. Here’s how it works now:

  • Hover interaction. One of the biggest problems with the old version of the multiple line graph was overlapping vertices. It was often impossible to hover over a vertex that was being covered by another vertex. For the new graph, we had the legend turn into a tooltip with values when mousing over the graph. The entire legend/tooltip is given an opacity of 0.8, so that lines may still be seen below when it overlaps.
  • Eliminating vertices. For line graphs with a large number of data series or a large time range, the large vertex size of the old graph (5 pixel radius) caused problems. The size of the vertices remained the same while the total amount of vertices increased. This caused an increasing percentage of the lines in between vertices to be covered up by the vertices, obscuring our ability to spot trends and changes in data.
  • Performance. For multiple line graphs over long time ranges, the old line graph required us to render sometimes hundreds of SVG circles. Eliminating vertices greatly improved performance, and also enabled graphs that weren’t possible before (for example, graphing something hourly over a month-long time range).
  • Mouseleave interaction. When the mouse leaves the chart area, the tooltip reverts to the initial position of the legend.

Despite these improvements, there are a number of tradeoffs we made and questions that remain.

  • Number of data series. The multiple line graph is limited to 10 data series (if there are more than 10 in the returned data, only the 9 largest and “Other” are shown). How can we simultaneously display more time series without overwhelming our users?
  • Tooltip/legend. In the multiple line graph, the legend often obscures the data. This is addressed somewhat with the lowered opacity of the legend and the ability to move it around, but there are other possibilities:
    • One option is to move the legend to the side of the graph, and keep it fixed there (like it was in the old version of the line graph). We chose not to do this, since this takes up horizontal space. Also, when mousing over the chart, the displayed values might be on the other side of the screen, which is suboptimal.
    • Another option would be to display a table of values below the chart and eliminate the hover interaction entirely. This is similar to how we redesigned our funnel visualization (which may be the topic of a future article). However this is suboptimal since there is no visceral connection between the line graph and the table. They’re just two different views of the same data, rather than a unified single visualization.

Hit us up @heap with questions, thoughts, or links to well-designed line graphs you’ve seen elsewhere. Or just leave them in the discussion on Hacker News.

Interested in designing tools or visualizing massive amounts of data? Reach out at jobs@heapanalytics.com!

Our pricing model was broken. Here’s how we fixed it.

Pricing your product correctly can be tricky, and a lot of technical people who build SaaS products get it wrong on their first few tries. Our story makes for a nice case study.

Heap is web and iOS analytics tool that captures every user interaction on your website and in your app. Instead of requiring you to log events in code, Heap captures everything upfront and lets you analyze it later. This means that, when you want to answer a question with data, you can do it immediately, instead of writing code, deploying it, and waiting for metrics to trickle in.

Our Old Pricing Model

We designed our initial pricing model with a few goals in mind:

  • It should be simple. We didn’t want to make the “engineer” mistake of having a highly configurable pricing scheme that confused people or scared them off. Someone who runs a website should be able to visit our pricing page and get a good idea how much Heap would cost for them pretty quickly.
  • It shouldn’t discourage people from using Heap more. We didn’t want to charge by the event definition or by the API call. This would disincentivize people from getting more value out of Heap.
  • It should scale gradually. We were concerned that customers would worry about about overages or going over discontinuities in their pricing plans.

We settled on a sliding scale based on the number of monthly unique users.

Ye olde pricing scheme.

Fancy widgets are nice, but they don’t make this a good pricing scheme.

This satisfies all of the design goals above. It’s based on a single, standard metric that owners of major websites know off the tops of their heads, and it doesn’t have discontinuities or perverse incentives.

This has some problems, though.

  • This plan charges people as soon as they sign up for Heap, when it hasn’t captured much data yet. Heap provides more value over time, as your analysis goes over more data, but we were asking for money before we had delivered any value.
  • Our initial model was too simple. It papers over serious differences in value provided to each customer. A social app making money off of ads might have a lifetime revenue per user on the order of $2, whereas a web store selling $1000+ items has a much larger one. Heap provides a lot more value per user to the latter than to the former. A pricing model that only considers the number of users is either undercharging the latter or overcharging the former.

Schools Of Thought

There are a number of standard approaches to this problem, and we thought a lot about two of them in particular.

  • Pricing based on cost. This can be intuitive from an engineering perspective, as we can accurately measure how much each customer costs to service in terms of hardware, but this can be opaque to customers. Pricing based on the amount of data stored (i.e., the number of events) is one way to go about this, but customers don’t have a great frame of reference for how many events Heap captures, especially before they’ve started to use it. This also isn’t a great fit for us, since our costs are dominated by an engineering team’s salaries. The portion of our AWS bill that goes towards any customer is easy to measure, but that customer’s share of our dev throughput is inscrutable.
  • Pricing based on value provided. This aligns our goals with the customer’s. The amount of value each customer gets out of using Heap corresponds with the amount they pay us. This has the reverse problem, though: it’s difficult to model in an explicit formula how much value another company gets out of using Heap. A lot of SaaS companies use the number of accounts or licenses as a proxy for value. If a customer dedicates a team of eight people towards using Heap full-time, they’re probably getting a lot more value out of it than a company in which only one person uses it.

Today’s Pricing Model

The wave of the future.

This includes a number of important changes.

Heap is now free for websites with fewer than 25k monthly visits. Free as in “don’t enter a credit card; we won’t be charging you.” Rather than trying to nickel and dime small websites, we’ve decided that the most important thing is to get Heap into as many peoples’ hands as possible.

We offer a 60-day free trial, because we want everyone to have a Heap experience. We’ve found that, once people try Heap, they usually don’t want to go back to manually defining events in code or waiting for new data. We settled on a trial period of 60 days, because this gives customers a chance to accumulate enough data that the power of doing analysis “retroactively” becomes apparent.

We charge by the visit, not by the monthly user. A “visit” occurs when someone loads a page on your website for the first time in at least 30 minutes. This gives us a closer approximation to how valuable that user is. Before, we were charging the same amount for someone who landed on your homepage and immediately left as for someone who came back ten times that month. In most cases, the latter is a lot more valuable.

This pricing model is simple, and it doesn’t disincentivize people from using Heap more. However, unlike our initial pricing model, it’s not gradual. We’re ok with this for now, since overages and pricing discontinuities don’t seem to be a serious concern for any of our customers. (The concern appears to be more common amongst tire-kickers.) But, it’s something we’re monitoring.

As for the choices of $149 and $499 as base prices for the tiers and 25k and 75k monthly visits as cutoffs between the tiers, the specific numbers are a little bit arbitrary. Looking at our existing ~500 customers, they made sense as parameter choices for us. For example, $149/month was close to what a lot of companies that would end up in the Growth tier were paying before, 75k monthly visits made sense as a cutoff after which we’d want to charge customers at ad-hoc levels, and so forth. A low “Enterprise Plan” cutoff is important, because it lets us attempt to charge our larger customers based on value provided.

For questions like “How many people are using our new feature?” or “Are people who signed up in the last two weeks more likely to convert to paying customers?” it should be possible to answer with data in a matter of minutes, instead of pushing new logging code and waiting for metrics over a span of days or weeks. In addition to answering your existing questions faster, a feedback loop on the order of minutes enables you to ask a whole new class of questions that would otherwise not be worth the effort of answering. This kind of analysis is empowering, and everyone with a website or an iOS app should be able to try it.

Like the technical pieces of a startup, our pricing plan is a work in progress, and we’re still refining it. We’re looking forward to seeing how this turns out.

We benefit tremendously from customer feedback. Give it a try and let us know what you think @heap!

Defining Events with Hierarchical CSS Selectors

Today, we’re introducing a new capability in Heap: defining web events with hierarchical CSS selectors. This makes it much easier for you to retroactively analyze user interactions, all without shipping code and without waiting for data.

Suppose you need to know whether customers interact with a gallery of images. The gallery is represented as a div.gallery with nested img elements.

Oh no – you forgot to assign classnames to these image tags! Not an issue with Heap. Just open the Define Event view, and define the “Click on Gallery Image” event like so:

This works exactly like a normal CSS selector. Just add spaces between each element to specify levels in the hierarchy.

And that’s it! Now you have retroactive data on this interaction computed instantly. You can segment on this event and use it to construct cohorts, as if you had been tracking it from the very beginning.

We updated Heap to start tracking hierarchical data on November 17, 2013. Thus, any query involving hierarchical events will extend back in time until this date. All non-hierarchical events will work the same as before (they’ll go back in time further and will only match the child-most element in the hierarchy).

How else should we let you define events? Let us know @heap!