Running 10 Million PostgreSQL Indexes In Production (And Counting)

Heap faces a unique set of challenges based on our approach to analytics: capturing every data point. After several attempts, we’ve managed to come up with a solution that works reliably for our use case. As a consequence of that solution, we have wound up with over 10 million indexes in our database!

What Heap Is

Heap is an analytics tool that captures everything that happens on a customer’s website. After the Heap snippet is added to a website, Heap will automatically capture every event that users do on that website. That means every click, pageview, form submission, etc, are all captured and logged into Heap. From there, the owner of the website can use Heap to answer questions about their users.

All questions in Heap are phrased in terms of “defined events”. An example of a defined event would be “Login”, which could be defined as a click where the text of the click target is “Login”. The event definition may also make use of more complicated restrictions such as requiring the CSS hierarchy of what was clicked on to match the CSS selector div.header button#login. After the login event is defined, it can be used to query all of the login events captured by Heap, ever since Heap was installed. It works just as if you had code that manually tracked logins the entire time.

While queries over historical data based on newly defined events are an amazing feature, they are extremely difficult to implement. We need to store all of the raw data indefinitely, and it is impossible to tell upfront what subset of it will end up being valuable, even though 90% of it will never be useful.

How Heap Solves This Problem

Our users’ queries are built in terms of these defined events, so one of the basic building blocks of our query infrastructure is the ability to find all instances of a particular defined event. We do this by making use of an amazing PostgreSQL feature: partial indexes.

A partial index is an index over a subset of the rows in a table. You define a partial index in nearly the same way you define a regular index; the only difference is that you specify a predicate in addition to the normal options. Only rows that satisfy the predicate will be included in the index. When a customer creates a new event definition, we create a partial index that contains only the events that match the event definition. As an example, if a customer were to define a login event, code similar to the following would be executed in our database:

CREATE INDEX ON events (time) WHERE type = ‘click’ AND text = ‘login’

Since the index contains a given row if and only if the row matches the event definition, we can use it to quickly find all events that satisfy the definition. Additionally, since the index is constructed on time, we can simultaneously use the index to filter for only the events that occurred in a given time range.

Once we are able to find instances of an event definition that are in a given time range, it becomes easy to implement more complicated queries such as graphs and conversion funnels. Let’s take a moment to look at how funnels are implemented. Let’s say a customer wants to know what percentage of the users who have signed up in a given week later made a purchase that same week. This would be implemented with a SQL query similar to the following:

This query works by first finding all events that match the login and purchase events that fall into the time range of the query. This can be done by making use of the partial indexes for the events. From there, the query constructs an array of zeros and ones for each user representing the order in which they did the events. In the array, a zero represents a signup and a one represents a purchase. For example, the array [0, 1, 1] would mean they signed up, later made a purchase, then after that made another purchase. In the case of an N step funnel, the numbers would range from 0 to N-1.

The query then passes those arrays through a PostgreSQL UDF we wrote in collaboration with the team at Citus Data. The UDF takes one of the arrays representing the order an individual did the events and returns an array that represents how far they got in the funnel. In the returned array, a one represents that the user completed that step, and a zero represents that they did not. The array [0, 0] would mean the user never signed up, the array [1, 0] would mean the user signed up but never made a purchase, and the array [1, 1] would mean the user completed the entire funnel. From there, the query sums all of those arrays to obtain an array with how many users made it through each step of the funnel. While the query looks and sounds complicated, the plan generated by the query is actually really simple since it makes use of the partial indexes:

It just fetches all of the events by using the two relevant partial indexes, groups the events by user, and performs an aggregation over each.

Advantages Of Partial Indexes

A great feature of partial indexes is that Postgres will automatically maintain them for us, and they will always be consistent with our raw data. Whenever a new raw event comes in, PostgreSQL will automatically check the raw event against all of the existing partial indexes. When the event satisfies the predicate for one of those partial indexes, the event will be inserted into that index.

Another nice property of our indexing strategy is that it degrades gracefully. If a partial index exists, PostgreSQL will automatically use it. If we haven’t yet had time to create a partial index, PostgreSQL will fallback to the next best strategy. While the query will be slower when the partial index doesn’t exist, it can still be executed and will still be correct. Once the partial index is created, PostgreSQL will automatically start using it, and the query will become much faster.

What really makes partial indexes so great is that PostgreSQL handles all of the work around building, maintaining, and querying the partial indexes for us. Since partial indexes are so easy to create and work with, we’ve wound up with over 10 million partial indexes across our entire cluster.

This isn’t as scary as it sounds for a two main reasons. First, we shard all of our data by customer. Each table in our database holds only one customer’s data, so each table has a only a few thousand indexes at most. Second, these events are relatively rare. The most common defined events make up only a few percent of a customer’s raw events, and most are much more rare. This means that we perform relatively little I/O maintaining this schema, because most incoming events match no event definitions and therefore don’t need to be written to any of the indexes. Similarly, the indexes don’t take up much space on disk.

Downsides Of Partial Indexes

Although partial indexes work well for our use case, there are a few downsides to having millions of them across our entire cluster. The biggest cost is the amount of CPU required to build and maintain them. If a Heap customer has a thousand defined events, we have to evaluate all of the predicates for those defined events for every new event that comes in. Most of these event definitions contain a regular expression that is used to match the CSS hierarchy of the event, which are expensive to evaluate at our scale of data.

We’ve also run into some issues around creating the partial indexes. Ideally, we would use CREATE INDEX CONCURRENTLY to create the partial indexes in the background, without impacting writes to the table. Unfortunately, PostgreSQL can only perform one concurrent index build at a time per table. That means when sent multiple CREATE INDEX CONCURRENTLY commands on the same table, PostgreSQL will execute the index builds in series. This is in addition to concurrent index builds taking several times longer than normal index builds. Another issue we’ve repeatedly run into is that a concurrent index build must wait for all existing transactions to finish. That means if any long running query is in flight (e.g. any customer is bulk-exporting their data), every index creation will have to wait for that query to finish.

Because of the issues around building indexes concurrently, we need to use the normal CREATE INDEX instead of CREATE INDEX CONCURRENTLY. To handle the inability to write to a table while a normal index build is taking place, our data ingestion pipeline will defer writing a customer’s events to PostgreSQL, whenever we are creating indexes for that customer. Even when running multiple index builds in parallel it can still take a long time to create of the indexes. Building all of the indexes for a table of one of our largest customers can take a full 24 hours! Interestingly, partial index creations are CPU bound. This makes sense since PostgreSQL has to evaluate the partial index predicate against every row in the table in order to construct the partial index.

Partial indexes are a great fit for our use case. With this feature, our customers are able to query hundreds of terabytes of data in seconds. All we have to do is send the `CREATE INDEX` command to PostgreSQL and it handles the construction, maintenance, and querying of the partial indexes for us.

If you want to learn more about our infrastructure, check out this talk where Dan discusses our backend in depth. Dan will also be speaking at PGConf SV 2016.

If all of this sounds interesting to you, Heap is hiring so please apply or reach out on twitter.

When To Avoid JSONB In A PostgreSQL Schema

PostgreSQL introduced the JSONB type in 9.4 with considerable celebration. (Well, about as much as you can expect for a new data type in an RDBMS.) It’s a wonderful feature: a format that lets you store blobs in the lingua franca of modern web services, without requiring re-parsing whenever you want to access a field, and in a way that enables indexing for complicated predicates like containment of other JSON blobs. It meaningfully extends PostgreSQL and makes it a viable choice for a lot of document store workflows. And it fits nicely in a startup engineering context: just add a properties column to the end of your table for all the other attributes you might want to store down the road, and your schema is now officially Future Proof TM.

We lean on JSONB heavily at Heap, and it’s a natural fit, as we have APIs that allow customers to attach arbitrary properties to events we collect. Recently, I’ve gotten a few questions about the benefits and drawbacks of using JSONB to store the entirety of a table – why have anything but an id and a data blob?

The idea of not having to explicitly manage a schema appeals to a lot of people, so it shouldn’t be surprising to see JSONB used this way. But there are considerable performance costs to doing so, some of which aren’t immediately obvious. There is great material for deciding which of JSON, JSONB, or hstore is right for your project, but the correct choice is often “none of the above.” [1] Here are a few reasons why.

Hidden Cost #1: Slow Queries Due To Lack Of Statistics

For traditional data types, PostgreSQL stores statistics about the distribution of values in each column of each table, such as:

  • the number of distinct values seen
  • the most common values
  • the fraction of entries that are NULL
  • for ordered types, a histogram sketch of the distribution of values in the column

For a given query, the query planner uses these statistics to estimate which execution plan will be the fastest. For example, let’s make a table with 1 million “measurements” of three values, each chosen at uniform random from {0, 1}. Each measurement was taken by one of 10,000 scientists, and each scientist comes from one of three labs:

Let’s say we want to get the tick marks in which all three values were 0 — which should be about 1/8th of them — and see how many times each lab was represented amongst the corresponding scientists. Our query will look something like this:

And our query plan will look something like this: https://explain.depesz.com/s/H4oY

This is what we’d hope to see: the planner knows from our table statistics that about 1/8th of the rows in measurements will have value_1, value_2, and value_3 equal to 0, so about 125,000 of them will need to be joined with a scientist’s lab, and the database does so via a hash join. That is, load the contents of scientist_labs into a hash table keyed on scientist_id, scan through the matching rows from measurements, and look each one up in the hash table by its scientist_id value. The execution is fast — about 300 ms on my machine.

Let’s say we instead store our measurements as JSONB blobs, like this:

The analogous read query would look like this:

The performance is dramatically worse — a whopping 584 seconds on my laptop, about 2000x slower: https://explain.depesz.com/s/zJiT

The underlying reason is that PostgreSQL doesn’t know how to keep statistics on the values of fields within JSONB columns. It has no way of knowing, for example, that record ->> 'value_2' = 0 will be true about 50% of the time, so it relies on a hardcoded estimate of 0.1%. So, it estimates that 0.1% of 0.1% of 0.1% of the measurements table will be relevant (which it rounds up to ~1 row). As a result, it chooses a nested loop join: for each row in measurements that passes our filter, look up the corresponding lab_name in scientist_labs via the primary key of the latter table. But since there are ~125,000 such measurements, instead of ~1, this turns out to take an eternity. [2]

As always, accurate statistics are a critical ingredient to good database performance. In their absence, the planner can’t determine which join algorithms, join orders, or scan types will make your query fast. The result is that innocent queries will blow up on you. This is one of the hidden costs of JSONB: your data doesn’t have statistics, so the query planner is flying blind.

This is not an academic consideration. This caused production issues for us, and the only way to get around them was to disable nested loops entirely as a join option, with a global setting of enable_nestloop = off. Ordinarily, you should never do something like that.

This probably won’t bite you in a key-value / document-store workload, but it’s easy to run into this if you’re using JSONB along with analytical queries.

Hidden Cost #2: Larger Table Footprint

Under the hood, PostgreSQL’s JSON datatype stores your blobs as strings that it happens to know are valid JSON. The JSONB encoding has a bit more overhead, with the upside that you don’t need to parse the JSON to retrieve a particular field. In both cases, at the very least, the database stores each key and value in every row. PostgreSQL doesn’t do anything clever to deduplicate commonly occurring keys.

Using the above measurements table again, the initial non-JSONB version of our table takes up 79 mb of disk space, whereas the JSONB variant takes 164 mb — more than twice as much. That is, the majority of our table contents are the the strings value_1, value_2, value_3, and scientist_id, repeated over and over again. So, in this case, you would need to pay for twice as much disk, not to mention follow-on effects that make all sorts of operations slower or more expensive. The original schema will cache much better, or might fit entirely in memory. The smaller size means it will also require half as much i/o for large reads or maintenance operations.

For a less contrived anecdote, we found a disk space savings of about 30% by pulling 45 commonly used fields out of JSONB and into first-class columns. On a petabyte-scale dataset, that turns out to be a pretty big win.

As a rule of thumb, each column costs about 1 bit of overhead for each row in your table, regardless of whether the column’s value is null.[3] So, for example, if an optional field is going to have a ten-character key in your JSONB blobs, and thus cost at least 80 bits to store the key in each row in which it’s present, it will save space to give it a first-class column if it’s present in at least 1/80th of your rows.

For datasets with many optional values, it is often impractical or impossible to include each one as a table column. In cases like these, JSONB can be a great fit, both for simplicity and performance. But, for values that occur in most of your rows, it’s still a good idea to keep them separate.

In practice, there is often additional context to inform how you organize your data, such as the engineering effort required to manage explicit schemas or the type safety and SQL readability benefits from doing so. But there is often an important performance penalty as well for unnecessarily JSONB-ing your data.

Know another innocuous change with big performance implications? Ping me @danlovesproofs.

We’re constantly evaluating alternative schemas and indexing strategies for serving ad hoc queries across hundreds of billions of events. Interested in working with us? Shoot us a note at jobs@heapanalytics.com.

[1] I recommend this post, for starters: https://www.citusdata.com/blog/2016/07/14/choosing-nosql-hstore-json-jsonb/
[2] As an aside, explain.depesz is a wonderful tool for finding problems like these in your queries. You can see in this example that the planner underestimated how many rows would be returned by this subquery by a factor of 124,616.
[3] This isn’t quite correct. PostgreSQL allocates one byte per row for the first 8 columns, and then 8 bytes / 64 columns at a time after that. So, for example, your first 8 columns are free, and the 9th costs 8 bytes per row in your table, and then the 10th through 72nd are free, and so forth. (H/t Michael Malis for the investigation into this.)

Goodbye CoffeeScript, Hello TypeScript

Web apps are becoming increasingly feature-rich, and the Heap frontend is no different. We expose an interface that lets users organize their data and build custom visualizations. Nearly every interaction changes an underlying model and there are subtle rules around how the UI behaves.

Our previous stack of Backbone and CoffeeScript wasn’t scaling well to a lot of common UI challenges, such as data syncing, subview management, and redrawable views. It was time to rethink how we could both improve maintainability and speed up future feature development.


It all starts with language

We wanted a language that would address 2 main concerns in our codebase:

  1. Code shouldn’t be hard to reason about

    We had too many implicit object schemas, mutative operations on inputs, and implicit cross-view assumptions. This caused many hard-to-reason about bugs, and sometimes very unexpected behavior in production.

  2. Development should be fast

    Write code => hopefully encounter runtime error when testing => repeat is inefficient both in speed and accuracy. Many common errors in code shouldn’t require a refresh of a test page and a mechanical series of clicks.

CoffeeScript

CoffeeScript was wildly popular in the early days of Heap and regarded as one of the most mature alternative JS languages. Here’s how it stacks up:

Pros:

  • Minimalistic syntax with lots of syntax sugar
  • Already well-known within team
  • Easy to map output JS back to source CoffeeScript (before source maps)

Cons:

  • Variable initialization and reassignment are the same

    It’s easy to accidentally overwrite a variable from a higher scope as a codebase increases in depth. This is really bad, because it limits our ability to write more complex code. Safely creating a variable requires pressing Ctrl + F and examining the current file to make sure there isn’t a conflict.

  • Existential operator accessor (?.) is a leaky abstraction for null/undefined

    This is often added to ‘make things work’, without understanding why it’s necessary.
    If we could document in a single place the potential ‘emptiness’ of a value, reading
    related code would be much easier.

  • Ambiguous syntax

    It’s reasonable to assume foo bar and hello world will compile to either:

    • foo(bar) && hello(world)
    • foo(bar && hello(world))

    depending on what you’re hoping it’ll compile to. In the words of one teammate:

    This is the only language I’ve worked in where I need to compile my code and verify the output to make sure it does what I expect it to.

TypeScript

Given the shortcomings of CoffeeScript, we investigated many alternatives, including
ClojureScript, Babel, Elm, and PureScript. TypeScript stood out as a clear winner:

Pros:

  • Built-in type annotations let you document interfaces and enforce their correctness at compile time, cutting down on logical errors
  • Already has several other tools built on it (e.g. Flow type annotations compiling down to TypeScript annotations, Angular 2.0)
  • Clear, maintained development roadmap with rapid releases
  • Community type definitions for adding types to third-party JavaScript

Cons:

  • Not yet 100% parity with ES6 (lagging behind Babel)
  • Some missing type system features: type bounds, half-baked local types, f-bounded polymorphism
  • Community type definitions are unversioned, so it’s difficult to find type annotations for older versions of libraries

Many other languages were disqualified for one or more of these reasons:

  • Overly complex external JavaScript workflow
  • Non-mainstream syntax
  • Small community – integrations may need to be self-written
  • Docs hard to navigate or find

Filling in the gaps

There are a few language features which we heavily missed in TypeScript:

Pattern matching is a common feature in functional languages that drastically reduces type casts, while encouraging consistent return types and discouraging mutation by directly returning an expression. We’ve built a small type-safe version of pattern matching in TypeScript and use it extensively throughout our codebase.

Implicit conversions are a safer form of monkey patching, a popular though frowned upon idiom in JavaScript. These exist in Ruby as refinements, and we can simulate them with a slightly more verbose syntax (inspiration heavily drawn from Scala). Here, we build a generic version of Immutable.js’ withMutations method:

While sometimes more verbose, consistent usage of these patterns leads to a codebase where nearly all bugs are logical errors, and not accidental organizational errors. This makes bugs easier to track down.


Uncaught TypeError: undefined is not a function

We had several bugs in our error tracker that were previously black holes – variables would take on a value of null or undefined, and it’d be nearly impossible to figure out why.

The easy solution: add more guarantees via types.

Options

Options declare whether something is allowed to be null/undefined. They completely solve the CoffeeScript issue of duplicating ?. operators throughout your codebase – you only need to declare the nullability of something once. After that, the compiler will warn you if you err in assuming that something won’t ever be null. This means a more reliable application and serves as living documentation that’ll auto-update as your application grows, decreasing ramp-up time into the codebase.

Since adopting these, we’ve eliminated the entire class of null/undefined errors. This is important because that class is among the hardest to track down, usually because the error occurrence and the error in logic end up far away from each other.

Trys

Trys declare whether something succeeded or failed. They’ve caused us to recognize and react to parsing errors when we try to convert data into client-side models. Parsing nested objects is easier as well, since we can compose parsers for each component of the aggregate object.

Futures

Futures (also known as Promises) declare whether an asynchronous operation succeeded or failed, and let you apply control flow to the result. This is far better than passing callbacks into other functions because logic around onComplete and onFail only needs to be implemented once, instead of for every single asynchronous function. We’ve chosen a more traditional choice in the form of when.js since it’s compliant with the A+ Promises spec, which promises better integration with other libraries.

Monapt is the open-source library we use throughout our codebase. Feel free to stop by gitter chat with any questions!

Language is just the beginning of creating a solid frontend codebase. The structure and
organization of views can make an enormous difference as well. In part 2, we’ll explain the
evolution of our view architecture, and lessons learned from React and Elm.

Let us know if you have any questions @heap! And if you’re interested in building robust interfaces that turn data into insight for thousands of people, shoot us a note at jobs@heapanalytics.com.

Snapshots: Track rich data without writing code

Last year, we launched the Event Visualizer, making analytics integration as simple as clicking around your website or iOS app. Using it, you can start measuring analytics events retroactively, without writing any code.

Today we’re introducing snapshots, a new feature that lets you attach rich data to events using the same point-and-click interface.

With Heap’s automatic event tracking, you can easily measure the number of times a user wants to make a purchase from a store. But if you want to know what items were purchased, or how much they cost, you’d need to write a tiny bit of tracking code that looks like heap.track(‘Add to Cart’, {Price: '$89.00'}).

With snapshots, you can simply:

  1. Click the Add to Cart button to define the ‘Add to Cart’ event.
  2. Attach a custom ‘Price’ property by hovering over the part of your page containing the price.
  3. That’s it!

Every time a user adds an item to their cart, the price is extracted from the ‘Price’ field on the page and attached to the event. You’ll be able to filter and group this event by price, without ever touching code. Watch it in action below:



You can snapshot relevant data on the page, attach them to an event, and later analyze them within the Heap dashboard. Here are some example use cases for snapshots:

  • User input: You can automatically snapshot input field values and attach them to an event every time it happens. Below, we’re sending along the departure and arrival airports when a user clicks ‘Search’.

  • Javascript variables: Alternatively, use Javascript to attach more complex data to events. Here, the page contains some key performance metrics which are captured via snapshots.

Snapshots are available today! To get started, just follow the directions in our docs. If you need help getting setup, let us know at support@heapanalytics.com.

Speeding Up PostgreSQL With Partial Indexes

Did you know PostgreSQL supports indexing a subset of your table? This enables very fast reads from that subset with almost no index overhead. It’s often the best way to index your data if you want to repeatedly analyze rows that match a given WHERE clause. This makes PostgreSQL a great fit for workflows that involve pre-aggregation with additional ad hoc analysis. In this post, I’ll walk through an example query optimization in which partial indexes are well suited.

Consider a table of events with the following schema:

Each event is associated with a user and has an ID, a timestamp, and a JSON representation of the event. The JSON includes the page path, the event type (e.g. click, page view, form submission), and any other properties that describe the event.

We can use this table to store many different kinds of events, and let’s assume we have some automatic event tracking in place that logs every click, page view, and form submission so that we can analyze it later. We might want to have an internal dashboard that displays some high-value metrics, such as the number of weekly signups or the amount of revenue we are collecting per day. The events that are relevant to this dashboard make up a small fraction of this table — a very small percentage of clicks on your website are to finalize purchases! But they are mixed in with the rest of the table, so our signal-to-noise ratio is low.

We would like to index our data to make these dashboard queries fast.[1] Let’s start with a signup event, which we define to be a form submission on our /signup/ page. Getting the number of such signups in the first week of September translates to:

On a synthetic dataset with 10 million events, of which 3000 are signups, and without any indexes, this query takes 45 seconds.

Full Indexes On Single Columns: A Mixed Bag

A naive way to improve this performance is by creating single-column indexes for each of the relevant event features: (data->>'type'), (data->>'path'), and time. We can use a bitmap join between results from three indexed scans, which should be fast if the query is selective and the relevant index portions are in memory. Indeed, with these indexes in place, this query takes 200 ms initially, and 20 ms in subsequent runs on our synthetic dataset — a significant improvement over the 45 seconds required by a sequential scan.

But this indexing strategy has a few significant downsides:

  • Write overhead. We need to write to all three indexes on every INSERT/UPDATE/DELETE run against the table.[2] For a write-heavy workload like the one in this example, that might be too expensive.
  • Limited query set. This strategy restricts our ability to define types of high-value events. It won’t work if we need something more complicated than a range query on one of those JSON fields. What if we want to match a regex, or all pageview paths that start with /signup/ but can include anything thereafter?
  • Disk use. The table in our test dataset takes up 6660 mb, and the three indexes take up a combined 1026 mb, a substantial increase in the amount of hard drive space we need to use to support this table.[3]

Enter Partial Indexes

We’re only analyzing the 0.03% of the table that constitutes signup events, but this strategy indexes all the rows. We want to be able to run efficient queries over a tiny fraction of a table. In a case like this, a partial index is going to give us the best results.

If we index an unrelated column and restrict our index to events that match our signup definition, PostgreSQL can easily determine where the signup event rows are, and we can query over them even more efficiently than with three complete indexes on the relevant fields. In particular, consider indexing the time field, but only on the rows that match the filter for signup events. That is:

CREATE INDEX event_signups ON event (time)
WHERE (data->>'type') = 'submit' AND (data->>'path') = '/signup/'

Using this index, our test query runs in 200ms initially and 2ms on subsequent runs, so there’s a performance improvement to be had if we are running this query often. More importantly, the partial index addresses all three disadvantages of the triple-index approach described above.

  • It only takes up 96 kb, a 10000x improvement over the 1026 mb required to fully index all three fields.
  • We only need to write to the partial index when new rows match our filter for signup events. Since those are 0.03% of our events, this constitutes a significant improvement in write performance: we essentially get this index for free.
  • The partial join allows for filtering with the full expressiveness of PostgreSQL. Our index’s WHERE clause can use any row-filtering expression that we could otherwise use in a query. We can use this approach to match more complicated event definitions, such as regexes, function results, or the prefix-matching example mentioned above.

Avoid Indexing Predicate Results As Booleans

Another approach I’ve seen is to attempt to index the boolean expression

(data->>'type') = 'submit' AND (data->>'path') = '/signup/'

directly, with a second term for time. I.e., something like:

CREATE INDEX event_signup_time ON event
(((data->>'type') = 'submit' AND (data->>'path') = '/signup/'), time)

This is worse than either of the approaches above, as it turns out PostgreSQL’s query planner won’t figure out that our example query restricts our result to rows for which the first term in the index is true. That is, the planner doesn’t know that the WHERE clause:

WHERE (data->>'type') = 'submit' AND (data->>'path') = '/signup/'

is equivalent to saying that the

((data->>'type') = 'submit' AND (data->>'path') = '/signup/')

field of our index must be true. So it uses this index to restrict the time range of events in our scan, but it reads events for which the boolean expression is true as well as the events for which it is false, and then checks that condition for each row after loading it.[4]

So, we’re reading a lot more rows from disk than necessary, and also executing a nontrivial check for each one. On our synthetic dataset, this query takes 25 seconds for the first run and 8 seconds on successive executions. Under the hood, this is slightly worse than just indexing on the time field, and it performs comparably to doing just that.

Partial indexes are a powerful way to precompute the subset of a table that matches a predicate. Judging by traffic in the #postgresql IRC, they are considerably underused. Compared to full indexes, they allow a greater range of predicates. They can be dramatically more lightweight, requiring fewer writes and less disk space, especially for highly selective filters. Partial indexes should be your default strategy if you’re repeatedly querying a table for a small slice of rows.

Love PostgreSQL? Know of a feature the world should know more about? Ping me @danlovesproofs.

Interested in building systems that make powerful technology easy to use? Shoot us a note at jobs@heapanalytics.com.

[1] We might deal with this kind of a situation with table partitioning, in which high-value events and low-value events can live in separate child tables, but this approach is going to be unwieldy if there are many different kinds of high-value events, and we would need to repartition the table every time we wanted to add a new type of high-value event.
[2] We may get some of the updates for free via the heap-only tuples optimization, but, at least, every INSERT and DELETE will require writing to all three indexes.
[3] We could index all three fields in one multi-column index, e.g. on ((data->>'type'), (data->>'path'), time). That would take up 755 mb, a 26% savings, but this is still large, and the other drawbacks still apply. What’s more, this index would be less applicable to other queries we might want to run on this data, so, if we’re supporting a few different kinds of high-value events, this might not end up saving us any space.
[4] The relevant query plan:

PostgreSQL’s Powerful New Join Type: LATERAL

PostgreSQL 9.3 has a new join type! Lateral joins arrived without a lot of fanfare, but they enable some powerful new queries that were previously only tractable with procedural code. In this post, I’ll walk through a conversion funnel analysis that wouldn’t be possible in PostgreSQL 9.2.

What is a LATERAL join?

The best description in the documentation comes at the bottom of the list of FROM clause options:

The LATERAL key word can precede a sub-SELECT FROM item. This allows the sub-SELECT to refer to columns of FROM items that appear before it in the FROM list. (Without LATERAL, each sub-SELECT is evaluated independently and so cannot cross-reference any other FROM item.)

When a FROM item contains LATERAL cross-references, evaluation proceeds as follows: for each row of the FROM item providing the cross-referenced column(s), or set of rows of multiple FROM items providing the columns, the LATERAL item is evaluated using that row or row set’s values of the columns. The resulting row(s) are joined as usual with the rows they were computed from. This is repeated for each row or set of rows from the column source table(s).

This is a bit dense. Loosely, it means that a LATERAL join is like a SQL foreach loop, in which PostgreSQL will iterate over each row in a result set and evaluate a subquery using that row as a parameter.

What can we do with this?

Consider a table of click events with the following schema:

Each event is associated with a user and has an ID, a timestamp, and a JSON blob with the event’s properties. At Heap, these properties might include the DOM hierarchy of a click, the window title, the session referrer, and so forth.

Let’s say we want to optimize our landing page to increase signups. The first step is to figure out where we’re losing users in our conversion funnel.

An example conversion funnel between four steps in a signup flow.
An example conversion funnel between four steps in a signup flow.

We’ll assume that we’ve instrumented our frontend to log events along this flow and that all of the data lives in the event table specified above.[1] As an initial question, let’s figure out how many people view our homepage and what percentage of them enter a credit card within two weeks of that initial homepage view. If we were using an older version of PostgreSQL, we might write some custom functions in PL/pgSQL, PostgreSQL’s builtin procedural language. But, in 9.3, we can use a lateral join to compute this in one efficient query, with no extensions or PL/pgSQL.

Nobody likes 30-line SQL queries, so let’s break this down into pieces. The first chunk of this is vanilla SQL:

That is, get the initial time each user did a view_homepage event. Then our lateral join allows us to iterate over each resulting row and perform a parametrized version of the next subquery. This is equivalent to taking the query below and running it for each resulting row:

I.e., for each user, get the first time he or she performed the enter_credit_card event within two weeks of view_homepage_time. Because this is a lateral join, our subquery can make reference to the view_homepage_time results from the previous subquery. Otherwise, the subqueries would be evaluated independently and there would be no way to access results from one when evaluating the other.

Then we wrap the whole thing in a select, which returns something like this:

Because this is a LEFT JOIN, the query still produces result rows for users with no matching enter_credit_card event, as long as there is a view_homepage event. If we aggregate over the numerical columns, we get a tidy summary of this conversion funnel:

… which produces:

We can add intermediate steps to this funnel with more lateral joins to evaluate which portions of our flow we should focus on improving.[2] Let’s add a use_demo step between viewing the homepage and entering a credit card.

This yields:[3]

This gives us the three-step conversion funnel from viewing the homepage to using the demo within a week to entering the credit card within a week of that. From here, the expressive power of PostgreSQL allows us to drill down on these results and thoroughly analyze the performance of our website. We might follow up with:

  • Does using the demo increase the likelihood of a signup?
  • Do users who discover our homepage via an ad convert with the same likelihood as users from other sources?
  • How do these conversion rates change over time or with different A/B test variants?

The answers to these questions apply directly to product changes and can be determined in PostgreSQL now that it supports lateral joins.

Without lateral joins, we would need to resort to PL/pgSQL to do this analysis. Or, if our data set were small, we could get away with complex, inefficient queries. In an exploratory data science use case, you might just pull your data out of PostgreSQL and analyze it with your scripting language of choice. But there is considerable power in being able to express these questions in SQL, especially if you’re wrapping it all in an approachable UI and exposing the functionality to nontechnical users.

Note that these queries can be tuned so as to be very efficient. In this example, if we create a btree index on (user_id, (data->>’type’), time), we can evaluate each funnel step for each user with a single indexed lookup. If you’re using SSDs, on which seeks are cheap, this might be good enough. If not, you might need to schematize your data a bit differently, but I’ll leave the details of that for another post.

Have a favorite new PostgreSQL feature or a neat lateral join use case? Ping me @danlovesproofs.

Interested in building systems that make powerful features easy to use? Shoot us a note at jobs@heapanalytics.com.

[1] Or we can use Heap, which captures everything for us as soon as we install it! No need to write any logging code, and no risk of forgetting to log something that you’ll want to analyze later.
[2] Note that adding additional steps to a conversion funnel would be particularly easy if we were using a product like Heap, since we would already have the relevant data.
[3] The number of users with enter_credit_card events is lower in this query than in the previous one, as this query only returns enter_credit_card events for users who do so after doing the use_demo event, and 17 of the users signed up without using the demo.

Building Automated Analytics Logging for iOS Apps

Analytics is often the first tool developers add to their iOS app. A standard approach is to write logging code like this:

Manual logging calls

Let’s call this manual event-tracking. With manual event-tracking, you write logging code for each analytics event you care about. A logEvent: for signing in, a logEvent: for inviting a friend, a logEvent: for opening the Settings view, and so forth.

Here, we’ll describe a different approach: automatic event-tracking. With automatic event-tracking, client-side events are automatically logged. Usage can be analyzed without having to define events upfront, without writing code, without going through the App Store, and with virtually no performance overhead.

In this post, we’ll provide a blueprint for building automatic event-tracking into your own app. Heap’s analytics library for iOS offers automatic event-tracking, but this information should be useful to you even if you don’t use Heap.

The Problem with Manual Event-Tracking

Let’s say you’ve launched your new iOS app. You took painstaking care to craft the perfect launch. You designed a beautiful landing page, appeared in major press outlets, and generated 5,000 signups in 2 days. Woo.

But after the initial fanfare subsides, you start to think: “Where did all my visitors drop off in our signup form?” “Which source of traffic had the highest conversion rate?”

Oh, right. In the frenzy of shipping, you forgot to instrument your signup flow.

Now your next two weeks will be spent:

  1. Mourning launch data that’s forever lost.
  2. Instrumenting your signup flow with logging code.
  3. Waiting for your app to get approved by Apple.
  4. Waiting.
  5. Waiting some more for data to trickle in.
  6. Analyzing your data.
Broken logging calls

This is what happens when a bug in your code breaks logging. You lose data forever. Source.

This is the fundamental issue with manual event-tracking. Tagging each event of interest in advance is brittle and often results in stale data. And since you don’t know what you don’t know, every new question requires new logging code, resulting in a lengthy feedback loop.

How Do We Build Automatic Event-Tracking?

On web, there’s only one event abstraction, so the answer is obvious: log every DOM event.

On iOS, we have more options:

  1. Log every touch event, in the form of a UIEvent or UIGestureRecognizer.
  2. Log every call to an action method with target-action or Interface Builder.
  3. Log every appearance of a view controller.
  4. Log every notification issued through NSNotificationCenter.

The correct automated framework will map as closely as possible to typical logging calls. There should also be enough context around each event, so that we can easily pinpoint salient interactions post-hoc.

Where do iOS developers normally place logging calls? A GitHub search for Flurry’s logEvent: method surfaces lots of code like this:

A search for analytics logging

There’s an obvious connection here between action methods and logging calls. Source.

This is only a single example, but the pattern is clear: analytics calls are typically made in action methods. If we log action methods as they occur, then we can save ourselves the overhead of sprinkling analytics code throughout our app.

(Note that Heap’s iOS library not only auto-logs action methods, but also UIEvents and view controllers. The general approach is the same for all three event types, but in this blog post, we’ll focus on implementing the former.)

In UIKit, every user interaction on a control passes a sendAction:to:from:forEvent: message to the global UIApplication object. If we extend sendAction:to:from:forEvent: to log each occurrence with its parameters, we’ll have a complete picture of the user’s flow through our app.

There’s one problem, though. The sendAction:to:from:forEvent: method is a built-in Cocoa method. We can’t modify built-in methods, so how can we hope to log each occurrence?

We’ll need to use method swizzling.

What is Method Swizzling?

Simply put, method swizzling allows you to extend an existing method implementation.

Subclassing and categories are common techniques for extending methods. But method swizzling is unique in that it works without forcing you to modify existing calls to the original method and without requiring access to the original method’s implementation. This means we can modify the behavior of built-in Cocoa methods.

This is very powerful. And it’s exactly how we’ll auto-log sendAction:.

Method swizzling is made possible by Objective-C’s flexible runtime. With a single #import <objc/runtime.h>, we gain access to various low-level goodies. In particular, we’ll use method_exchangeImplementations, which (unsurprisingly) lets you swap the implementations of two methods.

The final result looks like:

Here’s what’s happening step-by-step:

  1. First, we extend UIApplication with an EventAutomator category that contains a new method heap_sendAction:. This method will replace the built-in sendAction:. (A unique prefix is crucial in order to avoid namespace collisions.)
  2. Then, in our category’s +load method, we swap our method with the original using method_exchangeImplementations. This works because the Objective-C runtime special-cases +load to be invoked for every category within a class, even if +load is defined multiple times.
  3. Finally, we implement heap_sendAction: to log the action selector argument and then call the original sendAction: method.

At first glance, heap_sendAction: appears to be infinitely recursive. But remember that we swapped the implementation of heap_sendAction: with sendAction:. Thus, at runtime, sending the heap_sendAction: message actually calls code within sendAction:.

Method swizzling can be a bit of a mind-bender.

Note that this implementation of method swizzling is incomplete. For instance, what if the method we’re swizzling is defined within a superclass? Then the swap will ultimately fail, because the original method is missing from the class.

To avoid hairy edge cases like this, just use a method swizzling library. You should have little reason to deviate from existing implementations like JRswizzle or NSHipster’s.

The Results in Action

Though we’ve only written a few lines of code, this works! Watch it in action on Coffee Timer:

Automatic event tracking in action!

Remember: we’re getting this data without having written any logging code!

Feel free to dig into the underlying source code.

Method swizzling is central to building out an automatic analytics framework. Using the code we just wrote, we won’t need to worry about defining events of interest upfront. And whenever we need to analyze a new interaction, we won’t need to go through the bother of shipping new code to the App Store.

The Heap iOS library further expands upon this idea by auto-logging other types of events, including all UIEvents and every appearance of a view controller.

Performance Optimizations

If you’re wondering whether automatic event-tracking affects your app’s performance, know this: the performance impact is negligible. Our approach needs a few refinements, though.

Here are some of the performance optimizations we implement in Heap’s iOS library. If you’re considering building your own analytics, we recommend you implement them as well.

Batch network requests.

Network requests power up a device’s cell radio. A naive approach to automatic event-tracking – where every event is transmitted as soon as it occurs – would keep the radio powered for the entirety of a session. This is wasteful.

The solution is to transmit bursts of data at fixed time intervals (say, every 15 seconds). This keeps the radio triggered for a much smaller portion of time and conserves battery life, at the cost of a minor decrease in data recency.

Using the Coffee Timer app as our testbed, we see that automatic event-tracking puts minimal strain on battery life and CPU. The graphs below profile the energy/CPU usage of two highly-active sessions. Heap is active during one of these sessions and disabled during the other.

Can you tell which set of metrics correspond to which session? Neither can we. Even with frantic user input, our automatic event-tracker rarely exceeds 1% CPU usage.

A session with Heap enabled.
A session without Heap enabled.

Performance stats for nearly identical 1-minute sessions. Rate of activity was very high (~90 interactions).
Heap is on top, vanilla app is on bottom.

Process events off the main thread.

Method swizzling itself is lightweight, as it’s just a pointer swap at runtime. But before you send events to your servers, you’ll likely need to perform some client-side preprocessing. For instance, the Heap iOS library munges events into a URL format, tags events with metadata such as App Version, Carrier, and Device Model, and sanitizes event properties.

Don’t do this work on the main thread. Instead, push processing work onto a background queue (using Grand Central Dispatch or a NSOperationQueue). This unblocks UI rendering on the main thread and ensures your app remains snappy even under heavy load.

The Future of Mobile Analytics

This approach works for iOS apps built on Cocoa. It’s extremely performant, provides a comprehensive view of a user’s activity, and saves you the hassle of writing and maintaining analytics code.

But there’s plenty to be done to make this more broadly applicable. How do we handle games built on Cocos2d or Unity? How is automatic event-tracking implemented in Android apps? We hope to answer these questions in a follow-up post, as well as open-source our own iOS integration.

Let us know what you think on Twitter. And of course, if you’d prefer to use a battle-tested automatic event-tracking tool for iOS, consider using Heap.

Creating PostgreSQL Arrays Without A Quadratic Blowup

At Heap, we lean on PostgreSQL for most of the backend heavy lifting.[1] We store each event as an hstore blob, and we keep a PostgreSQL array of events done by each user we track, sorted by time. Hstore lets us attach properties to events in a flexible way, and arrays of events give us great performance, especially for funnel queries, in which we compute the drop off between different steps in a conversion funnel.[2]

In this post, we’ll take a look at a PostgreSQL function that hangs on large inputs and rewrite it in an efficient, idiomatic way.

If you’re writing a PL/pgSQL function that returns an array, it can be tempting to create an empty array and build up results in a loop with array_append or array_cat. But, as is often the case, procedural idioms in a relational database result in bad times.

Consider an example function in which we create an array of hstores such that the entry at position i is "num=>i".

Looks simple enough, but this is bad news. This takes a quadratic amount of time to run, blowing up to 36 seconds to generate an array of 100k elements.

Execution Times For blowup

Test queries were timed on a MacBook Pro with a 2.4GHz i7 and 16 GB of RAM.

What’s going on here? It turns out the repeated calls to array_append cause this quadratic explosion. When we call result := array_append(result, ...), PostgreSQL allocates a new array that’s wide enough for the result of the array_append call and then copies the data in. That is, array_append(array, new_element) is linear in the length of array, which makes the implementation above O(N2).

A lot of languages handle this idiom more gracefully. A common strategy is to resize the array that backs a list to double the existing size. With a list written this way, repeated appends would only require us to execute the “grow the array and copy over your data” operation a logarithmic number of times, and our amortized runtime would be linear.

So, PostgreSQL could be smarter, but this is not an idiomatic implementation, so we shouldn’t expect it to be. The correct way to do this is with array_agg — an aggregate function that takes a set and returns all of the entries as an array.

This is fast, and it scales linearly. It takes 300 ms to generate an array of 100k elements — a 100x reduction. This query allows PostgreSQL to generate the complete set of results before materializing any arrays, so it doesn’t need to do so more than once. PostgreSQL can compute upfront exactly how long the resulting array needs to be and only needs to do one allocation.

Lesson learned: if you find yourself calling array_append or array_cat in a loop, use array_agg instead.

When you’re working with any relational database, you should reconsider any procedural code you find yourself writing. Also, it helps to have an intuition for how long something “should” take. Generating a 100k element array (with around one megabyte of total data) shouldn’t take thirty seconds, and, indeed, it doesn’t need to.

We like learning stuff. Have any feedback or other PostgreSQL tips? Shoot us a note @heap.

[1] In particular, we use a lovely tool called Citus Data. More on that in another blog post!
[2] See: https://heapanalytics.com/features/funnels. In particular, computing a conversion funnel requires a single scan over the array of events a user has done and doesn’t require any joins.

Using PostgreSQL Arrays The Right Way

At Heap, we lean on PostgreSQL for most of the backend heavy lifting.[1] We store each event as an hstore blob, and we keep a PostgreSQL array of events done by each user we track, sorted by time. Hstore lets us attach properties to events in a flexible way, and arrays of events give us great performance, especially for funnel queries, in which we compute the drop off between different steps in a conversion funnel.[2]

In this post, we’ll take a look at a PostgreSQL function that unexpectedly hung on large inputs and rewrite it in an efficient, idiomatic way.

Your first instinct might be to treat arrays in PostgreSQL like their analogues in C-based languages. You might be used to manipulating data via array positions or slices. Be careful not to think this way in PostgreSQL, especially if your array type is variable length, e.g. json, text, or hstore. If you’re accessing a PostgreSQL array via its positions, you’re in for an unexpected performance blowup.

This came up a few weeks ago at Heap. We keep an array of events for each user tracked by Heap, in which we represent each event with an hstore datum. We have an import pipeline that appends new events to the right arrays. In order to make this import pipeline idempotent, we give each event an event_id entry, and we run our event arrays through a utility function that squashes duplicates. If we want to update the properties attached to an event, we just dump a new event into the pipeline with the same event_id.

So, we need a utility function that takes an array of hstores, and, if two events have the same event_id, it should take the one that occurs later in the array. An initial attempt to write this function looked like this:

This works, but it blows up for large inputs. It’s quadratic, and it takes almost 40 seconds for an input array of 100k elements!

Execution Times For dedupe_events_1

Test queries were timed on a macbook pro with a 2.4GHz i7 and 16 GB of ram and generated with this script: https://gist.github.com/drob/9180760.

What’s going on here? The issue is that PostgreSQL stores an array of hstores as an array of values, not an array of pointers to values. That is, an array of three hstores looks something like
{“event_id=>1,data=>foo”, “event_id=>2,data=>bar”, “event_id=>3,data=>baz”}
under the hood, as opposed to
{[pointer], [pointer], [pointer]}

For types that are variable length, e.g. hstores, json blobs, varchars, or text fields, PostgreSQL has to scan the array to find the Nth element. That is, to evaluate events[2], PostgreSQL parses events from the left until it hits the second entry. Then, for events[3], it re-scans from the first index all over again until it hits the third entry! So, evaluating events[sub] is O(sub), and evaluating events[sub] for each index in the array is O(N2), where N is the length of the array.

PostgreSQL could be smarter about caching intermediate parse results, or it could parse the array once in a context like this. The real answer is for arrays of variable-length elements to be implemented with pointers to values, so that we can always evaluate events[i] in constant time.

Even so, we shouldn’t rely on PostgreSQL handling this well, since this is not an idiomatic query. Instead of generate_subscripts, we can use unnest, which parses an array and returns a set of entries. This way, we never have to explicitly index into the array.

This is efficient, and it takes an amount of time that’s linear in the size of the input array. It takes about a half a second for an input of 100k elements, compared to 40 seconds with the previous implementation.

This does what we want:

  • Parse the array once, with unnest.
  • Partition by event_id.
  • Take the last occurrence for each event_id.
  • Sort by input index.

Lesson learned: if you find yourself accessing positions in a PostgreSQL array, consider unnest instead.

We like to avoid stubbing our toes. Have any feedback or other PostgreSQL tips? Shoot us a note @heap.

[1] In particular, we use a lovely tool called Citus Data. More on that in another blog post!
[2] See: https://heapanalytics.com/features/funnels. In particular, computing a conversion funnel requires a single scan over the array of events a user has done and doesn’t require any joins.

The Event Visualizer

Today we’re launching the Event Visualizer, which makes analytics integration as simple as clicking around your own website. For the first time, people with zero coding knowledge can start tracking events and generate important metrics instantly.

Watch it in action below:

Some of the design goals of the Event Visualizer:

  • The easiest possible integration. There’s no need to write code. Just use the point and click interface to define events for use in funnels or graphs.
  • Instant, retroactive metrics. Forgot to log a certain interaction? Not an issue – just redefine an event via the visualizer, and it will include all historical data from day 1.
  • Data you’ll understand. Sometimes teammates assign bewildering names to events: SHOPPING_FLOW_START or Login Event v2 newest. Instead, with this tool, you can visually see which on-screen actions correspond with which events.

We built the Event Visualizer because we found that non-technical people are becoming more and more dependent on analytics to make decisions. Marketers need to measure engagement across traffic sources, product managers need to quantify feature usage, salespeople need to identify promising leads, and designers need to understand paths their users take.

But the people consuming this data aren’t the people collecting this data. As a result, companies are bottlenecked on engineers to manually instrument the right events for them.

We think we’ve overcome this bottleneck. The Event Visualizer is our take on bringing analytics to the masses.

To get started, just sign up at https://heapanalytics.com.

The Event Visualizer is currently only available for web. Tweet us @heap if you’re interested in beta-testing our iOS visualizer.