Growth monitoring done right - learn how to save your hard earned metrics
There’s a lot of articles about growing your userbase, but how about protecting your hard earned metrics? There are many events that can affect your growth numbers, such as accidental product changes, external service outages, Google search algorithm changes, and more. Growth monitoring is very important because it is just as easy to lose metrics as to gain them. In the early stages of a company, most of these events aren’t even noticed, but as you grow, you’ll want to understand all significant metrics changes in order to mitigate negative ones and take advantage of positive ones. Growth monitoring is tough because it’s not just a technical problem. Not only do you have to have to right tools, but your team also has to have the right education and processes for handling metrics anomalies. This blog post will talk about a holistic approach to building an awesome growth monitoring system.
Growth Monitoring Framework Stages
In every tech company, engineers usually keep a tight watch on the overall api success rate and other system metrics to maximize uptime. If the uptime drops even a little, engineers respond quickly to resolve the issue. At Pinterest, we view growth metrics the same way. However, growth metrics are harder in some ways to maintain than overall success metrics for several reasons:
- Metrics fluctuate regularly throughout the day, week, and year so it can be hard to tell what’s wrong
- There are a lot of different growth metrics to maintain
- There are usually more external services that you rely on for growth that can affect your growth metrics
- There are more internal changes that can affect growth metrics such as simple copy changes
When incidents happen, we have lots of tools to detect and resolve as quickly as possible. However, building a robust framework wasn’t easy and takes a long time.
This diagram outlines the three stages that a growth team’s monitoring system is usually in. To get from stage 1 to stage 2, it’s all about building the right tools, which we’ll talk about first. To get from stage 2 to stage 3, you need to learn the attributes of different types of metrics impacting events, set up a playbook and process for handling them, and educate the team to execute on the plan.
Stage 1 to 2: Building the right tools
There are two main types of tools you want to build: real time graphs and alerts. First we’ll talk about real time graphs. Here are some attributes of great real time graphs:
- Easy to tell if something is abnormal - You will probably have many graphs, so you’ll want to be able to scan through them quickly while still being able to notice anomalies
- Easy to pinpoint time of incident - Usually the time of the incident is a good hint for what the root cause was
- Variety of depths of segmentation - You want graphs that give you the big picture, as well as graphs that help you pinpoint the exact segment that has a problem
- Good uptime - If logging breaks and is not fixed, in general people stop paying attention to those graphs
Here’s an example of one of the real time graphs at Pinterest growth:
After testing a few graph formats, We settled with this format because it most easily showed if something was wrong after a quick glance. Now, most of our monitoring graphs are in this format. It’s easy to tell if something is wrong, for example the graph below:
As you can tell, something happened on day 11 that caused metrics to drop significantly. We can pinpoint the time to narrow down the cause. Having different segmentations also helps us pinpoint root causes. For example, we have different signup graphs for each page type and signup type, so we know exactly which pages and signup flows are affected.
Once you have a lot of these graphs due to good segment coverage, you will definitely start needing alerts because it takes too much time to constantly observe the graphs. The two most important attributes of good growth alerts are:
- Low false positives: False positives are dangerous because if it happens too often, people stop paying attention to the alerts as much. This is especially difficult in growth because of daily/weekly swings and seasonality.
- Good segmentation: A lot of the time when incidents happen, they usually only affect certain segments, for example a certain page breaks or a certain referrer changes their traffic. A 100% loss in traffic to a referrer that makes up 5% of your total traffic will only show up as a “small” 5% in an overall traffic graph, which probably might not trigger an alert.
On Pinterest growth, our alerts closely follow our real time graphs. Our thresholds are always based on a ratio against the minimum of the previous two weeks. For example:
home_page_signupsnow < 0.8 * min(home_page_signups(-1week), home_page_signups(-2week))
Your first question might be, why is the ratio so low? Why not detect a 5% drop? Usually, the thresholds have to be fairly loose in order to reduce false positives. Keep in mind that you will have many alerts and that there are many internal and external events that affect each metric, and you want to prioritize the bigger deviations. Your next question might be, why look at the min of the two weeks? This is in order to prevent false positives after seasonal changes such as holidays. You might receive a large influx one week, and probably don’t want all your alerts to go off the next week. However, good alerts take a lot of iterations of tweaking, so just make sure to quickly tweak alerts that keep sending false positives. Once you have a good set of graphs and alerts, you’re in stage 2!
Stage 2 to 3: Building a knowledge base and process for reacting to anomalies
Now that you have a good set of tools, it’s time to learn how to use them effectively. You’ll probably find that in the beginning, some metrics change, but you don’t know why. It’s really important to spend the time to root cause because any learnings could be applied to the future, and eventually “debugging” metrics drops will happen much faster as you detect patterns. For example, we learned that when non attributed app installs drop, it’s commonly an App Store / Google Play store outage, so we look there first. Here is another example thought process that happens when something like home page signups drop.
Eventually, you’ll learn the right steps to take to debug each kind of metrics change, so the next challenge is to scale the knowledge through the whole team. On my team at Pinterest, we have a weekly metrics rotation, similar to an engineering on call, where one person is responsible for responding to the growth alerts. If they’re new, we may pair them with someone so that they can learn how the process is done. This way, everybody gets valuable practice knowledge around debugging metrics drops. Similar to engineering teams that own core services, it’s important to spread the knowledge and responsibility to prevent a single point of failure. When an alert goes off, the usual process is they report that they are investigating it in our metrics channel. That way, if someone knows about a recent change that could have caused it, they could chime in. Digging more into the metrics, checking the experience, and running git bisects are common next steps. When the issue is diagnosed, usually a rollback or hotfix is done. Due to the changing nature of the internet ecosystem, we’ve found that documentation gets outdated quickly so it’s important to train individual debugging skills through continuous learning and practice.
Companies take their service uptime seriously, and growth metrics need to be taken as seriously. If a metrics drop is not detected, it could accumulate losses bigger than some wins. It takes a long time to build a good monitoring system due to all the components that need to built and debugging knowledge that needs to grow, but it is well worth the investment. Make sure to take steps to improve your monitoring system to make sure you are maximizing your growth!
Want some advice with building your growth monitoring systems? Email me at email@example.com
Scaling new growth opportunities series