Creating PostgreSQL Arrays Without A Quadratic Blowup

At Heap, we lean on PostgreSQL for most of the backend heavy lifting.[1] We store each event as an hstore blob, and we keep a PostgreSQL array of events done by each user we track, sorted by time. Hstore lets us attach properties to events in a flexible way, and arrays of events give us great performance, especially for funnel queries, in which we compute the drop off between different steps in a conversion funnel.[2]

In this post, we’ll take a look at a PostgreSQL function that hangs on large inputs and rewrite it in an efficient, idiomatic way.

If you’re writing a PL/pgSQL function that returns an array, it can be tempting to create an empty array and build up results in a loop with array_append or array_cat. But, as is often the case, procedural idioms in a relational database result in bad times.

Consider an example function in which we create an array of hstores such that the entry at position i is "num=>i".

Looks simple enough, but this is bad news. This takes a quadratic amount of time to run, blowing up to 36 seconds to generate an array of 100k elements.

Execution Times For blowup

Test queries were timed on a MacBook Pro with a 2.4GHz i7 and 16 GB of RAM.

What’s going on here? It turns out the repeated calls to array_append cause this quadratic explosion. When we call result := array_append(result, ...), PostgreSQL allocates a new array that’s wide enough for the result of the array_append call and then copies the data in. That is, array_append(array, new_element) is linear in the length of array, which makes the implementation above O(N2).

A lot of languages handle this idiom more gracefully. A common strategy is to resize the array that backs a list to double the existing size. With a list written this way, repeated appends would only require us to execute the “grow the array and copy over your data” operation a logarithmic number of times, and our amortized runtime would be linear.

So, PostgreSQL could be smarter, but this is not an idiomatic implementation, so we shouldn’t expect it to be. The correct way to do this is with array_agg — an aggregate function that takes a set and returns all of the entries as an array.

This is fast, and it scales linearly. It takes 300 ms to generate an array of 100k elements — a 100x reduction. This query allows PostgreSQL to generate the complete set of results before materializing any arrays, so it doesn’t need to do so more than once. PostgreSQL can compute upfront exactly how long the resulting array needs to be and only needs to do one allocation.

Lesson learned: if you find yourself calling array_append or array_cat in a loop, use array_agg instead.

When you’re working with any relational database, you should reconsider any procedural code you find yourself writing. Also, it helps to have an intuition for how long something “should” take. Generating a 100k element array (with around one megabyte of total data) shouldn’t take thirty seconds, and, indeed, it doesn’t need to.

We like learning stuff. Have any feedback or other PostgreSQL tips? Shoot us a note @heap.

[1] In particular, we use a lovely tool called Citus Data. More on that in another blog post!
[2] See: https://heapanalytics.com/features/funnels. In particular, computing a conversion funnel requires a single scan over the array of events a user has done and doesn’t require any joins.

Using PostgreSQL Arrays The Right Way

At Heap, we lean on PostgreSQL for most of the backend heavy lifting.[1] We store each event as an hstore blob, and we keep a PostgreSQL array of events done by each user we track, sorted by time. Hstore lets us attach properties to events in a flexible way, and arrays of events give us great performance, especially for funnel queries, in which we compute the drop off between different steps in a conversion funnel.[2]

In this post, we’ll take a look at a PostgreSQL function that unexpectedly hung on large inputs and rewrite it in an efficient, idiomatic way.

Your first instinct might be to treat arrays in PostgreSQL like their analogues in C-based languages. You might be used to manipulating data via array positions or slices. Be careful not to think this way in PostgreSQL, especially if your array type is variable length, e.g. json, text, or hstore. If you’re accessing a PostgreSQL array via its positions, you’re in for an unexpected performance blowup.

This came up a few weeks ago at Heap. We keep an array of events for each user tracked by Heap, in which we represent each event with an hstore datum. We have an import pipeline that appends new events to the right arrays. In order to make this import pipeline idempotent, we give each event an event_id entry, and we run our event arrays through a utility function that squashes duplicates. If we want to update the properties attached to an event, we just dump a new event into the pipeline with the same event_id.

So, we need a utility function that takes an array of hstores, and, if two events have the same event_id, it should take the one that occurs later in the array. An initial attempt to write this function looked like this:

This works, but it blows up for large inputs. It’s quadratic, and it takes almost 40 seconds for an input array of 100k elements!

Execution Times For dedupe_events_1

Test queries were timed on a macbook pro with a 2.4GHz i7 and 16 GB of ram and generated with this script: https://gist.github.com/drob/9180760.

What’s going on here? The issue is that PostgreSQL stores an array of hstores as an array of values, not an array of pointers to values. That is, an array of three hstores looks something like
{“event_id=>1,data=>foo”, “event_id=>2,data=>bar”, “event_id=>3,data=>baz”}
under the hood, as opposed to
{[pointer], [pointer], [pointer]}

For types that are variable length, e.g. hstores, json blobs, varchars, or text fields, PostgreSQL has to scan the array to find the Nth element. That is, to evaluate events[2], PostgreSQL parses events from the left until it hits the second entry. Then, for events[3], it re-scans from the first index all over again until it hits the third entry! So, evaluating events[sub] is O(sub), and evaluating events[sub] for each index in the array is O(N2), where N is the length of the array.

PostgreSQL could be smarter about caching intermediate parse results, or it could parse the array once in a context like this. The real answer is for arrays of variable-length elements to be implemented with pointers to values, so that we can always evaluate events[i] in constant time.

Even so, we shouldn’t rely on PostgreSQL handling this well, since this is not an idiomatic query. Instead of generate_subscripts, we can use unnest, which parses an array and returns a set of entries. This way, we never have to explicitly index into the array.

This is efficient, and it takes an amount of time that’s linear in the size of the input array. It takes about a half a second for an input of 100k elements, compared to 40 seconds with the previous implementation.

This does what we want:

  • Parse the array once, with unnest.
  • Partition by event_id.
  • Take the last occurrence for each event_id.
  • Sort by input index.

Lesson learned: if you find yourself accessing positions in a PostgreSQL array, consider unnest instead.

We like to avoid stubbing our toes. Have any feedback or other PostgreSQL tips? Shoot us a note @heap.

[1] In particular, we use a lovely tool called Citus Data. More on that in another blog post!
[2] See: https://heapanalytics.com/features/funnels. In particular, computing a conversion funnel requires a single scan over the array of events a user has done and doesn’t require any joins.

The Event Visualizer

Today we’re launching the Event Visualizer, which makes analytics integration as simple as clicking around your own website. For the first time, people with zero coding knowledge can start tracking events and generate important metrics instantly.

Watch it in action below:

Some of the design goals of the Event Visualizer:

  • The easiest possible integration. There’s no need to write code. Just use the point and click interface to define events for use in funnels or graphs.
  • Instant, retroactive metrics. Forgot to log a certain interaction? Not an issue – just redefine an event via the visualizer, and it will include all historical data from day 1.
  • Data you’ll understand. Sometimes teammates assign bewildering names to events: SHOPPING_FLOW_START or Login Event v2 newest. Instead, with this tool, you can visually see which on-screen actions correspond with which events.

We built the Event Visualizer because we found that non-technical people are becoming more and more dependent on analytics to make decisions. Marketers need to measure engagement across traffic sources, product managers need to quantify feature usage, salespeople need to identify promising leads, and designers need to understand paths their users take.

But the people consuming this data aren’t the people collecting this data. As a result, companies are bottlenecked on engineers to manually instrument the right events for them.

We think we’ve overcome this bottleneck. The Event Visualizer is our take on bringing analytics to the masses.

To get started, just sign up at https://heapanalytics.com.

The Event Visualizer is currently only available for web. Tweet us @heap if you’re interested in beta-testing our iOS visualizer.

How We Estimated Our AWS Costs Before Shipping Any Code

Heap is a web and iOS analytics tool that automatically captures every user interaction, eliminating the need to define events upfront and allowing for flexible, retroactive analysis.

When we had the idea for Heap, it wasn’t clear whether its underlying tech would be financially tenable.

Plenty of existing tools captured every user interaction, but none offered much beyond rigid, pre-generated views of the underlying data. And plenty of tools allowed for flexible analysis (funnels, segmentation, cohorts), but only by operating on pre-defined events that represent a small subset of overall usage.

To our knowledge, no one had built: 1) ad-hoc analysis, 2) across a userbase’s entire activity stream. This was intimidating. Before we started coding, we needed to estimate an upper-bound on our AWS costs with order-of-magnitude accuracy. Basically: “Is there a sustainable business model behind this idea?”

To figure this out, we started with the smallest unit of information: a user interaction.

Estimating Data Throughput

Every user interaction triggers a DOM event. We can model each DOM event as a JSON object:

{
    referrer: 'https://www.google.com/search?q=banana',
    url: 'https://www.bananalytics.com/',
    type: 'click',
    target: 'div#gallery div.next',
    timestamp: 1387974232845
    ...
}

With all the properties Heap captures, a raw event occupies ~1 kB of space.

Our initial vision for Heap was to offer users unadulterated, retroactive access to the DOM event firehose. If you could bind an event handler to it, we wanted to capture it. To estimate the rate of DOM event generation, we wrote a simple script:

var start = Date.now(),
eventCount = 0;

for (var k in window) {
    // Find all DOM events we can bind a listener to
    if (k.indexOf('on') === 0) {
        window.addEventListener(k.slice(2), function(e){eventCount++});
    }
}

setInterval(function(){
    var elapsed = (Date.now() - start) / 1000;
    console.log('Average events per second: ' + eventCount / elapsed);
}, 1000);

Try it out yourself. With steady interaction, you’ll generate ~30 DOM events per second. Frenetic activity nets ~60 events per second. That’s a lot of data, and it resulted in an immediate bottleneck: the client-side CPU and network overhead.

Luckily, this activity mostly consists of low-signal data: mousemove, mouseover, keypress, etc. Customers don’t care about these events, nor can they meaningfully quantify it. By restricting our domain to high-signal events – click, submit, change, push state events, page views – we can reduce our throughput by almost two orders of magnitude with negligible impact on data fidelity.

With this subset of events, we found via manual testing that sessions rarely generate more than 1 event per second. We can use this as a comfortable upper-bound. And how long is the average session duration? In 2011, Google Analytics provided aggregate usage benchmarks and their latest figures claimed an average session lasted about 5 minutes and 23 seconds.

Note that the estimate above is the most brittle step of our analysis. It fails to account for the vast spectrum in activity across different classes of apps (playing a game of Cookie Clicker is more input-intensive than reading an article on The Economist). But we’re not striving for perfect accuracy. We just need to calculate an upper-bound on cost that’s within the correct order of magnitude.

By multiplying the values above, we find that a typical web session generates 323 kB of raw, uncompressed data.

Architectural Assumptions and AWS

We have a sense of the total data generated by a session, but we don’t know the underlying composition. How much of this data lives on RAM? On SSD? On spinning disks?

To estimate, we made a few assumptions about our nascent infrastructure, making sure to err on the side of over-performance and increased costs:

  1. Queries need to be fast. Because lots of data would be access in an ad-hoc fashion, we presumed our cluster would be I/O bound. Thus, we intended to keep as much of the working set in memory as possible.
  2. Therefore, the last month of data needs to live in RAM. We assumed the lion’s share of analysis would take place on recent data. These queries need to be snappy, and the simplest way of ensuring snappiness is by throwing it all into memory. An aggressive goal, but not unreasonable.
  3. Data older than a month needs to live in SSDs. Given AWS’s reputation for fickle I/O, we made the assumption that spinning disks wouldn’t suffice, on either EBS or ephemeral stores. Provisioned IOPS helps, but offers a maximum throughput of 4k IOPS per volume, which is far less than the 10k-100k IOPS we measured with SSDs.
  4. We need to use on-demand instances for everything. If the business model only works with (cheaper) 1-year or 3-year reserved instances, then we’d need to commit much more capital upfront. We’d likely be cash-flow negative from day 1, thereby increasing the company’s risk and forcing us to raise more money. We also needed to assume any early-stage architecture would be in constant flux.

With AWS’s on-demand instances, we identified several storage options. (Note that the new I2 instances didn’t exist yet.)

  • RAM on High-Memory Quadruple Extra Large, which offers the cheapest cost/memory ratio at $10.33/GB/month.
  • SSD on High I/O Quadruple Extra Large, which offers the cheapest cost/SSD ratio at $1.09/GB/month.
  • Spinning disk on EBS or S3, at $0.10/GB/month.

You can see a stark difference in costs across each:

Amazon’s pricing page is frustratingly inconducive to price analysis, so we consulted the always-wonderful ec2instances.info.

RAM is an order-of-magnitude more expensive than SSDs, which in turn are an order of magnitude more expensive than spinning disks. Each drop-off is almost exactly 10x. Because memory is the dominant factor in our analysis, we can simplify calculations by focusing exclusively on the expected cost of RAM.

Final Estimate

After calculating the expected size of a visit and the price of RAM, we estimated a cost of (323 kB/visit) × ($0.0000103/kB/month) = $0.0033 (0.33 cents) per visit per month. Put another way: for Heap’s business model to work, a visit needs to offer on average one-third a cent of value to our customers.

With this figure, we reached out to a range of companies – small to medium-sized, web and mobile, e-commerce/SaaS/social – and based on their monthly visits, explicitly asked each one “Would you pay $X to eliminate most manual event instrumentation?” Their enthusiastic responses gave us the confidence to start coding.

Unforeseen Factors

This estimate was indeed within the correct order of magnitude. But as our pricing page shows, we charge quite a bit less than 0.33 cents per visit. We aren’t burning money with each visit. Our estimates were just a bit off.

A few unforeseen factors reduced costs:

  1. Compression. The complexity of an app or site’s markup doesn’t matter: when users click, they tend to click on the same things. This creates a lot of redundancy in our data set. In fact, we’ve seen a compression factor of up to 5x when storing data via Postgres.
  2. CPU. Our queries involve a large amount of string processing and data decompression. Much to our surprise, this caused our queries to become CPU-bound. Instead of spending more money on RAM, we could achieve equivalent performance with SSDs (which are far cheaper). Though we also needed to shift our costs towards more CPU cores, the net effect was favorable.
  3. Reserved Instances. Given the medium-term maturity of our infrastructure, we decided to migrate our data from on-demand instances to 1-year reserved instances. Our instances are heavily utilized, with customers sending us a steady stream of queries throughout the day. Per the EC2 pricing page, this yields 65% yearly savings.

On the other hand, there were a couple of unexpected factors that inflated costs:

  1. AWS Bundling. By design, no single instance type on AWS strictly dominates another. For example, if you decide to optimize for cost of memory, you may initially choose cr1.8xlarge instances (with 244GB of RAM). But you’ll soon find yourself outstripping its paltry storage (240 GB of SSD), in which case you’ll need to switch to hs1.8xlarge instances, which offer more disk space but at a less favorable cost/memory ratio. This makes it difficult to squeeze savings out of our AWS setup.
  2. Data Redundancy. This is a necessary feature of any fault-tolerant, highly-available cluster. Each live data point needs to be duplicated, which increases costs across the board by 2x.

Sound estimation is critical, especially for projects that contain an element of technical risk. As we’ve expanded our infrastructure and scaled to a growing userbase, we’ve found these techniques invaluable in guiding our day-to-day work.

If this sort of thinking excites you, and you’re interested in building highly-scalable systems, reach out at jobs@heapanalytics.com.

Getting the details right in an interactive line graph

At Heap, we recently redesigned our line graphs with an eye towards user experience and data transparency. Line graphs are simple and well-known, and many of the changes we made may seem small or inconsequential. However, the net effect is quite dramatic, which you can see for yourself by interacting with the live graphs below.

Heap is web and iOS analytics tool that captures every user interaction and lets you analyze it later. This means that, when you want to answer a question with data, you can do it immediately, instead of writing code, deploying it, and waiting for metrics to trickle in over the span of weeks.

Heap’s Old Line Graph

Below is the old version of our interactive line graph:

This is a usable line graph, and many of the design decisions we made seemed reasonable.

  • Hover targets. The chart allows users to mouse over the vertices (dots) in order to see a tooltip showing the number value of the data point.
  • Interpolation. We made a decision to use monotone cubic interpolation to draw the lines. This causes the lines to look “curved” or “smoothed” between points, instead of a jagged line that merely connects the dots. We did this for aesthetic reasons.
  • Animation. When hovering over different vertices, the tooltip has a 100ms transition animation to the new location. This helps the interaction feel more fluid.

Let’s take a look at how this version of the graph fares with multiple series:

Aside from the addition of a legend, there are almost no changes here. A big problem was that we didn’t treat a multiple line graph as a different design problem than the single line graph, and it suffered as a result. We’ll see below how reconsidering this approach led to a lot of improvements.

The New Line Graph

Customer feedback and our own usage of the line graph helped us uncover several problems. We addressed these problems with our new interactive line graph, shown below:

There are a number of improvements to the single line graph.

  • Target size. The old targets were far too small. A user needed to align their mouse exactly with a target with a 5 pixel radius. Both repeated use and Fitts’s Law told us that this was a suboptimal interaction. The new version of the line graph displays values in a tooltip when mousing over any part of the chart area. It uses the x-value of the mouse to determine which vertex to target.
  • Animation. We lowered the tooltip animation length to 50ms, to eliminate jerky, distracting animations caused by large animation times on the old line graph. We didn’t eliminate animations entirely, however, since they give an impression of continuity. The animation also uses linear easing instead of d3’s default “cubic-in-out” easing, which allows for smoother transitions especially when moving the mouse across between many data points.
  • Less clutter. We removed the x-axis “Date” label, since the x-axis on our line graphs was always a time series, and people can recognize that an axis with labels like “Jun 1″ or “March 8 – March 15″ refers to time periods. There’s no need to include a label “Date” which takes up vertical space and adds nothing to comprehensibility. However, we retained the y-axis label, since units change across graph types (pageviews, visits, events, etc).
  • Linear interpolation. We got rid of the monotone cubic smoothing/interpolation of the lines, since it’s potentially misleading. Instead the lines between vertices are now straight.
  • Mouseleave interaction. When the mouse leaves the chart area, the tooltip disappears. This was an oversight in our previous version of the line graph.

The multiple line graph is also improved, and many of these improvements are a result of thinking about the design of the multiple line graph specifically. Here’s how it works now:

  • Hover interaction. One of the biggest problems with the old version of the multiple line graph was overlapping vertices. It was often impossible to hover over a vertex that was being covered by another vertex. For the new graph, we had the legend turn into a tooltip with values when mousing over the graph. The entire legend/tooltip is given an opacity of 0.8, so that lines may still be seen below when it overlaps.
  • Eliminating vertices. For line graphs with a large number of data series or a large time range, the large vertex size of the old graph (5 pixel radius) caused problems. The size of the vertices remained the same while the total amount of vertices increased. This caused an increasing percentage of the lines in between vertices to be covered up by the vertices, obscuring our ability to spot trends and changes in data.
  • Performance. For multiple line graphs over long time ranges, the old line graph required us to render sometimes hundreds of SVG circles. Eliminating vertices greatly improved performance, and also enabled graphs that weren’t possible before (for example, graphing something hourly over a month-long time range).
  • Mouseleave interaction. When the mouse leaves the chart area, the tooltip reverts to the initial position of the legend.

Despite these improvements, there are a number of tradeoffs we made and questions that remain.

  • Number of data series. The multiple line graph is limited to 10 data series (if there are more than 10 in the returned data, only the 9 largest and “Other” are shown). How can we simultaneously display more time series without overwhelming our users?
  • Tooltip/legend. In the multiple line graph, the legend often obscures the data. This is addressed somewhat with the lowered opacity of the legend and the ability to move it around, but there are other possibilities:
    • One option is to move the legend to the side of the graph, and keep it fixed there (like it was in the old version of the line graph). We chose not to do this, since this takes up horizontal space. Also, when mousing over the chart, the displayed values might be on the other side of the screen, which is suboptimal.
    • Another option would be to display a table of values below the chart and eliminate the hover interaction entirely. This is similar to how we redesigned our funnel visualization (which may be the topic of a future article). However this is suboptimal since there is no visceral connection between the line graph and the table. They’re just two different views of the same data, rather than a unified single visualization.

Hit us up @heap with questions, thoughts, or links to well-designed line graphs you’ve seen elsewhere. Or just leave them in the discussion on Hacker News.

Interested in designing tools or visualizing massive amounts of data? Reach out at jobs@heapanalytics.com!

Our pricing model was broken. Here’s how we fixed it.

Pricing your product correctly can be tricky, and a lot of technical people who build SaaS products get it wrong on their first few tries. Our story makes for a nice case study.

Heap is web and iOS analytics tool that captures every user interaction on your website and in your app. Instead of requiring you to log events in code, Heap captures everything upfront and lets you analyze it later. This means that, when you want to answer a question with data, you can do it immediately, instead of writing code, deploying it, and waiting for metrics to trickle in.

Our Old Pricing Model

We designed our initial pricing model with a few goals in mind:

  • It should be simple. We didn’t want to make the “engineer” mistake of having a highly configurable pricing scheme that confused people or scared them off. Someone who runs a website should be able to visit our pricing page and get a good idea how much Heap would cost for them pretty quickly.
  • It shouldn’t discourage people from using Heap more. We didn’t want to charge by the event definition or by the API call. This would disincentivize people from getting more value out of Heap.
  • It should scale gradually. We were concerned that customers would worry about about overages or going over discontinuities in their pricing plans.

We settled on a sliding scale based on the number of monthly unique users.

Ye olde pricing scheme.

Fancy widgets are nice, but they don’t make this a good pricing scheme.

This satisfies all of the design goals above. It’s based on a single, standard metric that owners of major websites know off the tops of their heads, and it doesn’t have discontinuities or perverse incentives.

This has some problems, though.

  • This plan charges people as soon as they sign up for Heap, when it hasn’t captured much data yet. Heap provides more value over time, as your analysis goes over more data, but we were asking for money before we had delivered any value.
  • Our initial model was too simple. It papers over serious differences in value provided to each customer. A social app making money off of ads might have a lifetime revenue per user on the order of $2, whereas a web store selling $1000+ items has a much larger one. Heap provides a lot more value per user to the latter than to the former. A pricing model that only considers the number of users is either undercharging the latter or overcharging the former.

Schools Of Thought

There are a number of standard approaches to this problem, and we thought a lot about two of them in particular.

  • Pricing based on cost. This can be intuitive from an engineering perspective, as we can accurately measure how much each customer costs to service in terms of hardware, but this can be opaque to customers. Pricing based on the amount of data stored (i.e., the number of events) is one way to go about this, but customers don’t have a great frame of reference for how many events Heap captures, especially before they’ve started to use it. This also isn’t a great fit for us, since our costs are dominated by an engineering team’s salaries. The portion of our AWS bill that goes towards any customer is easy to measure, but that customer’s share of our dev throughput is inscrutable.
  • Pricing based on value provided. This aligns our goals with the customer’s. The amount of value each customer gets out of using Heap corresponds with the amount they pay us. This has the reverse problem, though: it’s difficult to model in an explicit formula how much value another company gets out of using Heap. A lot of SaaS companies use the number of accounts or licenses as a proxy for value. If a customer dedicates a team of eight people towards using Heap full-time, they’re probably getting a lot more value out of it than a company in which only one person uses it.

Today’s Pricing Model

The wave of the future.

This includes a number of important changes.

Heap is now free for websites with fewer than 25k monthly visits. Free as in “don’t enter a credit card; we won’t be charging you.” Rather than trying to nickel and dime small websites, we’ve decided that the most important thing is to get Heap into as many peoples’ hands as possible.

We offer a 60-day free trial, because we want everyone to have a Heap experience. We’ve found that, once people try Heap, they usually don’t want to go back to manually defining events in code or waiting for new data. We settled on a trial period of 60 days, because this gives customers a chance to accumulate enough data that the power of doing analysis “retroactively” becomes apparent.

We charge by the visit, not by the monthly user. A “visit” occurs when someone loads a page on your website for the first time in at least 30 minutes. This gives us a closer approximation to how valuable that user is. Before, we were charging the same amount for someone who landed on your homepage and immediately left as for someone who came back ten times that month. In most cases, the latter is a lot more valuable.

This pricing model is simple, and it doesn’t disincentivize people from using Heap more. However, unlike our initial pricing model, it’s not gradual. We’re ok with this for now, since overages and pricing discontinuities don’t seem to be a serious concern for any of our customers. (The concern appears to be more common amongst tire-kickers.) But, it’s something we’re monitoring.

As for the choices of $149 and $499 as base prices for the tiers and 25k and 75k monthly visits as cutoffs between the tiers, the specific numbers are a little bit arbitrary. Looking at our existing ~500 customers, they made sense as parameter choices for us. For example, $149/month was close to what a lot of companies that would end up in the Growth tier were paying before, 75k monthly visits made sense as a cutoff after which we’d want to charge customers at ad-hoc levels, and so forth. A low “Enterprise Plan” cutoff is important, because it lets us attempt to charge our larger customers based on value provided.

For questions like “How many people are using our new feature?” or “Are people who signed up in the last two weeks more likely to convert to paying customers?” it should be possible to answer with data in a matter of minutes, instead of pushing new logging code and waiting for metrics over a span of days or weeks. In addition to answering your existing questions faster, a feedback loop on the order of minutes enables you to ask a whole new class of questions that would otherwise not be worth the effort of answering. This kind of analysis is empowering, and everyone with a website or an iOS app should be able to try it.

Like the technical pieces of a startup, our pricing plan is a work in progress, and we’re still refining it. We’re looking forward to seeing how this turns out.

We benefit tremendously from customer feedback. Give it a try and let us know what you think @heap!

Defining Events with Hierarchical CSS Selectors

Today, we’re introducing a new capability in Heap: defining web events with hierarchical CSS selectors. This makes it much easier for you to retroactively analyze user interactions, all without shipping code and without waiting for data.

Suppose you need to know whether customers interact with a gallery of images. The gallery is represented as a div.gallery with nested img elements.

Oh no – you forgot to assign classnames to these image tags! Not an issue with Heap. Just open the Define Event view, and define the “Click on Gallery Image” event like so:

This works exactly like a normal CSS selector. Just add spaces between each element to specify levels in the hierarchy.

And that’s it! Now you have retroactive data on this interaction computed instantly. You can segment on this event and use it to construct cohorts, as if you had been tracking it from the very beginning.

We updated Heap to start tracking hierarchical data on November 17, 2013. Thus, any query involving hierarchical events will extend back in time until this date. All non-hierarchical events will work the same as before (they’ll go back in time further and will only match the child-most element in the hierarchy).

How else should we let you define events? Let us know @heap!

Heap for iOS

Heap launched a couple months ago with a new approach to user analytics: just capture everything. It lets businesses conduct event-based analytics without having to ship code or wait for data to trickle in.

Now we’re ready to bring Heap to native iOS apps.

Mobile analytics for iPhone and iPad can be particularly egregious. If you haven’t explicitly tagged the “invite friend” event, for instance, but you need to analyze what type of users most often invite their friends, then you’re forced to:

  1. Hunt down and manually instrument the “invite friend” event within your Objective-C code.
  2. Submit updated code for App Store approval.
  3. Wait.
  4. Wait.
  5. Once the app is pushed live, wait some more for data to accumulate on the tracked event.
  6. Finally ask your question.

Heap for iOS, on the other hand, automatically captures native interactions such as taps and swipes, so that the process above reduces to “ask your question”.

How it Works

To integrate Heap into an iOS app, just add our library package.

ios_xcode

Once the newly-updated project is shipped, Heap will automatically capture all touch events and gestures on all user sessions, even if the user is offline.

To define an event after-the-fact, all you need to do is specify a gesture type and the corresponding UIView instance variable.

ios_event_defn

The event applies retroactively to all the past user activity within Heap. You can include it in funnels or segment on it, as if you had defined it from the very beginning.

ios_funnel

We’ve been especially relentless in keeping the footprint of our iOS library low. CPU and memory activity remains negligible, even in apps with frantic user input (such as games). We also preserve bandwidth by batching user activity and only periodically sending data over the wire.

Request an Invite

Heap is in invite-only mode. To request an invite for the iOS library, just send us your email below.

Folks who tweet us @heap have been known to get advance invites. :-)

We hope Heap for iOS makes mobile analytics much easier for you, so that you can focus more time on development and marketing. As always, drop us a line for any feedback or ideas.

What’s Heappening

10x Faster Queries

Our favorite part of Heap is the ability to define events and users cohorts post-hoc. This makes it much easier for us to ask arbitrary product questions on the fly and answer them in real-time.

So in order to further optimize Heap’s feedback loop, we’ve spent a considerable amount of time optimizing our query engine. We’ve been able to cut down query execution times by a factor of 10, and average query times are currently well below 400ms. Our performance goals are:

  • 50% of queries must finish in <250ms
  • 90% of queries must finish in <1s
  • 99% of queries must finish in <3s
Try it out! Digging into data should in general feel much, much snappier now.

Event Feed Trends 

“We know there are known unknowns; that is to say, we know there are some things we do not know.
But there are also unknown unknowns – the ones we don’t know we don’t know.”
- Donald Rumsfeld
Our newly revamped Event Feed is a first step in solving the perennial issue: how can I make sure I’m asking the right questions? The new interface automatically displays trends for the most meaningful client-side events performed by your users, so that any anomalies will visually pop out at you. Notice sinusoidal behavior in event A? Is there a dramatic decrease in event B? Just tag the event in question, and you can start running segmentation on it without any delay. And of course, all of its past history will automatically be waiting for you.

 

Custom API

We’re offering a client-side API for logging custom events and user properties. Once these properties are logged, you can query them in exactly the same fashion as any other property or event.

    heap.identify({name: ‘Frieza’, gender: ‘no idea’, age: ’50′})
    heap.track(‘Purchase’, {cost: 50, item: ‘Decommissioned nuclear stockpile’})

The interface should be reminiscent of other event-based tracking tools, but with the added benefit of Heap’s more flexible querying engine. Want to list all users who’ve paid you >$100? Or defeated 3 monsters but haven’t completed level 4? It’s all just a quick query away.

If we’re doing our jobs properly, you’ll need to rely on the custom API less and less over time. See more at: https://heapanalytics.com/docs

Data Pivoting

We’ve built a nifty new view that lets you quickly slice n’ dice the results of a given query. Switch from stack area graphs to bar charts to tables. Expand the time range or hone in on a particular date. Split a cumulative view into individual buckets. Granulize results even further. Visualize your data however you choose.
 

Heap Heap Hooray

Time Series

You can now graph the frequency of any event or cohort over time. Head over to the Define view, click on Graph, and choose amongst any of your definitions. With the appropriate cohort defined, you can track active user growth:

blog_2-26-13_1

Or understand which sites are sending the most registrations:
blog_2-26-13_2

Easy Cohorts

Understanding trends and peculiarities across user segments is really, really critical. But existing products suck at analyzing user cohorts. So Heap makes it as simple as possible with the HAS DONE option. Engaged users are users where HAS DONE upload profile pic or HAS DONE send chat. Registered users are users where HAS DONE register. Needy users are users where HAS DONE visit help page. (Lamentably, queries don’t quite read like perfect English yet.)

blog_2-26-13_3

Regexes and Autocomplete

You can now leverage regexes to more precisely filter data and define objects. Use it to segment users referred from a marketing campaign (initial_referrer contains ‘utm_id=1234′) or events occurring on a certain page (path contains ‘/settings/’), among other things.
 
blog_2-26-13_4

Additionally, query fields now autocomplete, so you can craft queries more quickly and effectively. Fields are sanitized on your behalf.
blog_2-26-13_5