• Skip to secondary menu
  • Skip to main content
  • Skip to primary sidebar
  • My Store
  • Glossary
  • Home
  • About Me
  • Contact Me

Statistics By Jim

Making statistics intuitive

  • Graphs
  • Basics
  • Hypothesis Testing
  • Regression
  • ANOVA
  • Probability
  • Time Series
  • Fun

Proxy Variables: The Good Twin of Confounding Variables

By Jim Frost 10 Comments

Proxy variables are easily measurable variables that analysts include in a model in place of a variable that cannot be measured or is difficult to measure. Proxy variables can be something that is not of any great interest itself, but has a close correlation with the variable of interest.

Photograph of a spider web to represent a web of correlations for proxy variables and confounders.
A web of correlations can help or hurt you. Be the spider, not the fly.

Imagine you have an important variable to include in your model but you can’t measure it. If you leave it out, it’s a confounding variable that can flip your statistical analysis results on its head thanks to omitted variable bias. Random assignment in experiments can protect you from confounders in some cases.

However, what do you do when you can’t randomize and you can’t measure the important variable to include it in your model? Are you stuck with omitted variable bias?

Fortunately, proxy variables are a potential solution.

Confounding variables and proxy variables are related concepts: correlated predictor variables. But there’s a huge difference between them:

  • Confounding variables affect your results in undesirable ways by not being included in the model. They are primarily a danger when you aren’t aware of them during the analysis.
  • Proxy variables benefit your analysis. You know about and intentionally include them in the model to improve your results.

Wise data analysts can find ways to avoid getting burned by confounding variables and instead use proxy variables to their advantage. Here’s a case where knowledge truly is power: specifically, knowledge of your subject matter and the correlation structure amongst your variables allows you to use these correlations to your advantage.

Prediction

Imagine you are mostly interested in predicting something and that you don’t care so much about identifying true cause-and-effect relationships. Fortunately, prediction doesn’t always require a causal relationship between predictor and response. Instead, a proxy variable that is simply correlated to the response, and is easier to obtain than a causally connected variable, might well do the job.

For example, an analyst I know uses regression analysis for fantasy football. Recently, he used a model that included one predictor variable — each player’s fantasy football points from the prior season — to predict his points for the subsequent season. Clearly, the points from one season are not causing the points for the next season. Rather, the points are a proxy variable for a host of other variables such as each player’s skills and capabilities, those of their team, the teams they play against, etc. It’s impossible to measure all of these, so a proxy variable is essential. His model for choosing quarterbacks has an R-squared of 73.68%. In this case, there is enough of a correlation from one year to the next that he can use the model for prediction, even though we don’t know or measure the exact causal variables.

Related post: Using Regression to make Predictions

Produce unbiased results

Now, imagine you are working on a research project where some of the variables are difficult, if not impossible, to measure. Remember, if you don’t include the intended variable in any form, omitted variable bias can produce inaccurate results. Including an imperfect proxy of a hard-to-measure variable is often better than not including an important variable at all. So, if you can’t include the intended variable, look for a proxy!

Related post: Confounding Variables and Omitted Variable Bias

Examples of proxy variables

Intended variable Proxy variable
Historical environmental conditions Widths of tree rings
Quality of life Per-capita GDP
True body fat percentage Body Mass Index (BMI)
Cognitive ability Years of education and/or GPA
Depth that light penetrates into the ocean over large areas Satellite images of ocean surface color
Hormone levels in blood Changes in height over a fixed time

Do you have examples of proxy variables that have helped you out in your analyses?

Share this:

  • Tweet

Related

Filed Under: Regression Tagged With: conceptual

Reader Interactions

Comments

  1. Mariama Kamara (@konemariama1) says

    August 27, 2021 at 5:57 am

    what proxy variables can be used for food security (availability, access and utilization)?

    Reply
  2. Ron says

    May 25, 2021 at 10:52 pm

    Great article – I download your intro to stat’s books to see if you cite the opening line but didn’t find it. Do you have a source or text book that can be referenced for using highly corelated proxy variables in a forecast model?
    “Proxy variables are easily measurable variables that analysts include in a model in place of a variable that cannot be measured or is difficult to measure. Proxy variables can be something that is not of any great interest itself, but has a close correlation with the variable of interest.”

    Reply
    • Jim Frost says

      May 26, 2021 at 3:50 am

      Hi Ron, I write about using proxy variables in my regression book. There is more detail in it than I include in this post.

      Reply
  3. SW says

    May 12, 2021 at 9:53 am

    Hi! Thanks for the great explanation. I was wondering if it possible to have two proxy confounders “between” the confounder and the outcome (or exposure)?

    E.g. exposure proxy confounder -> proxy confounder -> outcome

    Many thanks!
    SW

    Reply
  4. Amanullah says

    March 20, 2021 at 3:00 am

    Thank you so much for this wonder full explanation

    Reply
    • Jim Frost says

      March 23, 2021 at 3:42 pm

      You’re very welcome, Amanullah!

      Reply
  5. Jeremy says

    March 15, 2021 at 12:23 pm

    Another type of variable, the opposite of this in a way, is an “index” variable. It’s a combination of several measurements into one variable. For example, the Bureau of the Census has a lot of data it collects on census tracts: average people per household, percent of people earning over or under a certain amount, percent below the poverty level, etc… these can be combined with a formula into a “socioeconomic index” score for each census tract. You can make the formula any way you want — weight some variables more than others, for example; or include mostly economic characteristics. This is used in demographics and epidemiology, often as a proxy for socio-economic status. “Socio-economic Status” can be anything, there’s no strict definition of it, so it can’t be measured directly in the first place.

    Reply
    • Jim Frost says

      March 16, 2021 at 12:26 am

      Thanks for sharing! That sounds like a great way to incorporate a variety of information in your model.

      Reply
  6. Thokozani Chimkono says

    March 15, 2021 at 1:31 am

    Very instrumental books indeed !

    Reply
    • Jim Frost says

      March 16, 2021 at 12:27 am

      Thank you!

      Reply

Comments and Questions Cancel reply

Primary Sidebar

Meet Jim

I’ll help you intuitively understand statistics by focusing on concepts and using plain English so you can concentrate on understanding your results.

Read More...

Buy My Introduction to Statistics Book!

Cover of my Introduction to Statistics: An Intuitive Guide ebook.

Buy My Hypothesis Testing Book!

Cover image of my Hypothesis Testing: An Intuitive Guide ebook.

Buy My Regression Book!

Cover for my ebook, Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models.

Subscribe by Email

Enter your email address to receive notifications of new posts by email.

    I won't send you spam. Unsubscribe at any time.

    Follow Me

    • FacebookFacebook
    • RSS FeedRSS Feed
    • TwitterTwitter

    Top Posts

    • How to Interpret P-values and Coefficients in Regression Analysis
    • How To Interpret R-squared in Regression Analysis
    • Multicollinearity in Regression Analysis: Problems, Detection, and Solutions
    • Mean, Median, and Mode: Measures of Central Tendency
    • How to Find the P value: Process and Calculations
    • Z-table
    • How to do t-Tests in Excel
    • One-Tailed and Two-Tailed Hypothesis Tests Explained
    • Choosing the Correct Type of Regression Analysis
    • How to Interpret the F-test of Overall Significance in Regression Analysis

    Recent Posts

    • Slope Intercept Form of Linear Equations: A Guide
    • Population vs Sample: Uses and Examples
    • How to Calculate a Percentage
    • Control Chart: Uses, Example, and Types
    • Monte Carlo Simulation: Make Better Decisions
    • Principal Component Analysis Guide & Example

    Recent Comments

    • Jim Frost on Monte Carlo Simulation: Make Better Decisions
    • Gilberto on Monte Carlo Simulation: Make Better Decisions
    • Sultan Mahmood on Linear Regression Equation Explained
    • Sanjay Kumar P on What is the Mean and How to Find It: Definition & Formula
    • Dave on Control Variables: Definition, Uses & Examples

    Copyright © 2023 · Jim Frost · Privacy Policy