A guide to regression analysis

As a product manager, you need to analyze data in different ways in order to make the best product roadmap decisions. For qualitative feedback, you might analyze customer support tickets or conduct user interviews. For quantitative feedback, you might conduct surveys or analyze product usage. You could analyze this data with common strategies like pie charts, bar charts, summaries, and growth trends.

But what if you aren’t just interested in measuring something, but rather in understanding the relationship that data has with an outcome you care about? For example, rather than knowing how many people used a feature, you want to know whether people who use a feature are more likely to convert.

This is where regression analysis comes in. Regression analysis is an important analytical strategy to incorporate in your product management toolkit. In this blog post, we’ll cover what it is and provide step-by-step guidance on how you can leverage it as a product manager.

What is regression analysis?

Regression analysis is a method of statistical analysis that helps you understand the relationship between a set of data points and an outcome that you care about. In math speak, the set of data points you collect are independent variables, while the outcome you care about is the dependent variable.

Regression analysis and use cases

There are many use cases for regression analysis. As a PM, you might leverage regression analysis for:

Understanding what drives successful adoption of your product
Identifying factors that result in good (or bad) retention
Segmenting customers to identify what your best customers look like

As defined above, regression analysis focuses on finding the relationship between independent and dependent variables. Independent variables are the data points that you plug into your analysis to try to understand how they impact your dependent variable. Dependent variables are the outcomes you care about.

Let’s use the first use case listed above as an example. Successful adoption of your product is the dependent variable — you can define this as converted to paid or invited X number of people into the product. All the factors that might drive this are independent variables — this could be the number of times a user used a feature, the actual features they used, their job title, their seniority, etc.

Once you conduct a regression analysis, you’ll be able to understand how each feature impacts the ultimate dependent variable, as well as predict the dependent variable for a set of known independent variables. Using our earlier example, you’ll know what features impact conversion to paid. And given a user, you’ll be able to predict whether or not that person will convert.

There are two types of regression analysis that you should know about: linear regression and logistic regression.

Linear regression is used for understanding how independent variables impact dependent variables, where the dependent variable is continuous. For example, in the use cases listed above, figuring out which features that drive the most revenue will allow you to forecast and predict revenue, which is continuous.

Logistic regression is great for classification problems, where you’re trying to figure out if something is true or not. This is what we call a binary classification problem. So in the above use cases, predicting whether or not a user will convert is a binary outcome and a good use case for logistic regression.

As product managers, it’s simpler to start with linear regression over logistic regression, as logistic regression typically involves more powerful statistical packages outside of what’s offered by Excel and Google Sheets by default. We’ll cover this a bit more in the following section.

Over 200k developers and product managers use LogRocket to create better digital experiences

Learn more →

The best way to think about linear regression in general is to think about the equation for a simple line where y=ax+b. Y is your dependent variable. X is your independent variable. A regression analysis tries to figure out A and B.

When you have multiple dependent variables, you’ll have multiple “A’s” (coefficients or weights). The coefficient / weight that sits in front of the independent variable roughly estimates the impact that variable has on the dependent variable. That said, it’s important not to just look at the weight and assume that higher weights mean more important variables.

There might be variables that you didn’t account for that skew those weights, or different weights might counteract each other, making two completely irrelevant variables appear both highly positive and highly negative.

3 examples of regression analysis in product management

Before we go deeper into exactly how you can run a regression analysis, let’s walk through the three use cases for regression analysis in product management:

Understanding what drives successful adoption of your product
Identifying factors that result in good (or bad retention)
Segmenting customers to identify what your best customers look like

Understanding what drives successful adoption of your product

This use case is important as a product manager because if you understand what drives successful adoption of your product, you can aim to make sure all your users achieve those success metrics. You can use these factors to define your ideal customer journey and create onboarding flows around these factors.

Dependent variable — Did a user adopt your product, yes or no

Independent variables — Feature usage, sign-ins, invites, role, title

Identifying factors that result in good (or bad) retention

If you can understand the factors that drive good or bad retention over time, you’ll be able to drive good behaviors and proactively address bad behaviors. This is powerful for designing nurture campaigns and in-product nudges. Retention is a bit trickier to define as an outcome, since it’s time-based.

Dependent variable — How long a user uses the product

Independent variables — Feature usage, sign-ins, invites, role, title, company type, industry

Segmenting customers to identify what your best customers look like

Successful customers aren’t just defined by how they use your product — who they are also matters. One persona might be more likely to be successful than another persona. By segmenting your customers and understanding which factors result in better customers, you can prioritize flows that appeal to your target customers.

Dependent variable — Revenue per customer

Independent variable — Company industry, company size, job title, location

Step-by-step guide for using Excel / Google Sheets to perform a regression analysis

Now that we’ve covered what regression analysis is and provided some examples that are relevant for product managers, let’s walk through exactly how you can use spreadsheet software like Excel or Google Sheets to run your own analysis.

Setting up your regression analysis involves three key steps:

Ensuring your data is “clean”
Performing the analysis
Measuring the results

1. Ensuring your data is “clean”

One of the biggest mistakes you can make that will result in the wrong conclusions is if your data is not “clean”. First of all, you want to make sure that you’ve appropriately defined your dependent variable. For example, if your dependent variable is whether or not someone converted, you want to make sure that what you’re measuring is actually a conversion.

You might think that this is obvious, but I have seen people use event data like “subscription started” to track conversions before pulling the full history of events. In this case, you might have accounts that converted prior to the period you’re looking at.

On the independent variable side, there are a couple important things to be aware of. First of all, just like with dependent variables, you want to make sure you’re accurately measuring things. You’ll also want to do what we call “feature engineering” which allows models to work better with the data you give them. Some common examples include:

Removing data points that are too sparse (too few examples have values)
Bucketing data points that have too many unique values (often relevant for free-form survey answers)
Normalizing data points that are too large in magnitude

Once you feel like you’re in a good place with your data, the next step is to actually run the analysis!

2. Performing the analysis

Let’s start first with a linear regression. Google Sheets offers the LINEST() function which allows you to perform linear regressions directly in Google Sheets. You’ll want to format your Sheet with several columns: one column for your dependent variable, and then one column for each of your independent variables. Here’s an example of what that might look like for an analysis that includes two independent variables:

Dependent variable	Variable A	Variable B
1	2	8
2	4	4
3	5	2
4	6	1
5	7	1
6	3	1

I would recommend leaving the last variable as TRUE as you’ll be able to see more statistics.

The regression coefficient that gets displayed is what you can compare to figure out which variables are impacting the outcome, and by how much. Don’t worry too much about the actual specific values, but focus on comparing different independent variables with each other, as well as determining which ones are positive versus negative.

Excel also offers the LINEST() function that you can leverage to run a linear regression, and it functions just like the Google sheets one.

In Excel, you’ll want to add the Analysis ToolPak Add-in. This will expose an analysis section in your top header, where you’ll be able to select a bunch of different data analysis methods. You’ll notice that “regression” is one of the options you can select. You’ll be able to configure a regression analysis in the modal pop up, and the output will look a bit nicer than what LINEST() returns.

When you’re dealing with binary variables (remember, this is when the value for a variable can only be one of two possible values), logistic regression is a better way to analyze this data. Because this is a more complex function, Excel and Google Sheets do not have a comparable function for logistic regression, so you’ll need to look into additional statistics packages in order to run logistic regressions.

You’ll need to interact with your data team to have them build a logistic regression. Python has many data libraries that can do this for you if you’re familiar with it.

Note, it’s important to realize that linear regression is a constrained and simpler analysis — it doesn’t work well if you have categorical variables (either independent or dependent) as linear regressions assume a linear relationship between variables.

3. Measuring the results

There are many different stats out there that measure the results of your regressions. Linear and logistic regressions have different stats to measure effectiveness — for example, with linear regression, the R-squared coefficient is a useful measure, whereas the F1 score is good for logistic regression.

However, all of these different acronyms sound incredibly complicated, and really aren’t practical for product managers to focus on. I like to tell people who are dabbling in regression analysis to focus on whether or not the results can actually drive outcomes. Even if your model has the perfect statistics, it doesn’t mean that it will necessarily perform in a way that addresses your real needs.

Instead, I think the best way to see if your results are impactful is to use those results to predict expected outcomes going forward, and to see how well those predictions match what actually happens. If you’re able to measurably improve business metrics, it doesn’t matter what your F1 score is.

Conclusion

Understanding what regression analysis is and how you can incorporate it into your day-to-day workflow is critical if you want to become a more data-driven product manager. Regression analysis is a powerful tool to include in your toolkit of data analysis!

But before you rely on regression analysis too much, remember that correlation does not imply causation. And don’t forget that the quality of your data matters just as much as how good your actual analysis is.

As your regression analysis gets more advanced, leverage your data teams as much as possible. They’ll understand what it takes to get a good regression model going.

Featured image source: IconScout

LogRocket generates product insights that lead to meaningful action

LogRocket identifies friction points in the user experience so you can make informed decisions about product and design changes that must happen to hit your goals.

With LogRocket, you can understand the scope of the issues affecting your product and prioritize the changes that need to be made. LogRocket simplifies workflows by allowing Engineering, Product, UX, and Design teams to work from the same data as you, eliminating any confusion about what needs to be done.

Get your teams on the same page — try LogRocket today.