How do you compare regression lines statistically? Imagine you are studying the relationship between height and weight and want to determine whether this relationship differs between basketball players and non-basketball players. You can graph the two regression lines to see if they look different. However, you should perform hypothesis tests to determine whether the visible differences are statistically significant. In this blog post, I show you how to determine whether the differences between coefficients and constants in different regression models are statistically significant. [Read more…] about Comparing Regression Lines with Hypothesis Tests
You’ve settled on a regression model that contains independent variables that are statistically significant. By interpreting the statistical results, you can understand how changes in the independent variables are related to shifts in the dependent variable. At this point, it’s natural to wonder, “Which independent variable is the most important?” [Read more…] about Identifying the Most Important Independent Variables in Regression Models
Intervals are estimation methods in statistics that use sample data to produce ranges of values that are likely to contain the population value of interest. In contrast, point estimates are single value estimates of a population value. Of the different types of statistical intervals, confidence intervals are the most well-known. However, certain kinds of analyses and situations call for other types of ranges that provide different information. [Read more…] about Confidence Intervals vs Prediction Intervals vs Tolerance Intervals
My last birthday wasn’t one of those difficult ages that end with a zero. Thank goodness! However, the passage of another year got me thinking. At that point, I told myself that age is just a number. Can you do a mental double-take? I think I did one. Can a statistician say that age is just a number? After all, it’s through numbers that statisticians understand the world and how it works. [Read more…] about As a Statistician, Can I Say Age is Just a Number?
Data mining and regression seem to go together naturally. I’ve described regression as a seductive analysis because it is so tempting and so easy to add more variables in the pursuit of a larger R-squared. In this post, I’ll begin by illustrating the problems that data mining creates. To do this, I’ll show how data mining with regression analysis can take randomly generated data and produce a misleading model that appears to have significant variables and a good R-squared. Then, I’ll explain how data mining creates these deceptive results and how to avoid them. [Read more…] about Using Data Mining to Select Regression Models Can Create Serious Problems
When your regression model has a high R-squared, you assume it’s a good thing. You want a high R-squared, right? However, as I’ll show in this post, a high R-squared can occasionally indicate that there is a problem with your model. I’ll explain five reasons why your R-squared can be too high and how to determine whether one of them affects your regression model. [Read more…] about Five Reasons Why Your R-squared can be Too High
Despite the popular notion to the contrary, understanding the results of your statistical hypothesis test is not as simple as determining only whether your P value is less than your significance level. In this post, I present additional considerations that help you assess and minimize the possibility of being fooled by false positives and other misleading results. [Read more…] about Five P Value Tips to Avoid Being Fooled by False Positives and other Misleading Hypothesis Test Results
Overfitting a model is a condition where a statistical model begins to describe the random error in the data rather than the relationships between variables. This problem occurs when the model is too complex. In regression analysis, overfitting can produce misleading R-squared values, regression coefficients, and p-values. In this post, I explain how overfitting models is a problem and how you can identify and avoid it. [Read more…] about Overfitting Regression Models: Problems, Detection, and Avoidance
As my family and I were being rattled around in a four-wheel drive vehicle in the remote Osa Peninsula in Costa Rica, it struck me that traveling to exotic locations is just like manually adjusting the scales on graphs! That’s probably not what you were expecting, but let me explain! Unlike most of my statistical blog posts, this one gets a bit philosophical! [Read more…] about World Travel, Rough Roads, and Manually Adjusting Graph Scales!
Does your regression model have a low R-squared? That seems like a problem—but it might not be. Learn what a low R-squared does and does not mean for your model. [Read more…] about How to Interpret Regression Models that have Significant Variables but a Low R-squared
How high does R-squared need to be in regression analysis? That seems to be an eternal question. [Read more…] about How High Does R-squared Need to Be?
In regression analysis, curve fitting is the process of specifying the model that provides the best fit to the specific curves in your dataset. Curved relationships between variables are not as straightforward to fit and interpret as linear relationships. [Read more…] about Curve Fitting using Linear and Nonlinear Regression
P values determine whether your hypothesis test results are statistically significant. Statistics use them all over the place. You’ll find P values in t-tests, distribution tests, ANOVA, and regression analysis. P values have become so important that they’ve taken on a life of their own. They can determine which studies are published, which projects receive funding, and which university faculty members become tenured!
Ironically, despite being so influential, P values are misinterpreted very frequently. What is the correct interpretation of P values? What do P values really mean? That’s the topic of this post! [Read more…] about P values and Statistical Significance
R-squared is a goodness-of-fit measure for linear regression models. This statistic indicates the percentage of the variance in the dependent variable that the independent variables explain collectively. R-squared measures the strength of the relationship between your model and the dependent variable on a convenient 0 – 100% scale. [Read more…] about How To Interpret R-squared in Regression Analysis
Hypothesis testing is a vital process in inferential statistics where the goal is to use sample data to draw conclusions about an entire population. In the testing process, you use significance levels and p-values to determine whether the test results are statistically significant.
You hear about results being statistically significant all of the time. But, what do significance levels, P values, and statistical significance actually represent? Why do we even need to use hypothesis tests in statistics? [Read more…] about How Hypothesis Tests Work: Significance Levels (Alpha) and P values
P-values and coefficients in regression analysis work together to tell you which relationships in your model are statistically significant and the nature of those relationships. The coefficients describe the mathematical relationship between each independent variable and the dependent variable. The p-values for the coefficients indicate whether these relationships are statistically significant. [Read more…] about How to Interpret P-values and Coefficients in Regression Analysis
A confidence interval is calculated from a sample and provides a range of values that likely contains the unknown value of a population parameter. In this post, I demonstrate how confidence intervals and confidence levels work using graphs and concepts instead of formulas. In the process, you’ll see how confidence intervals are very similar to P values and significance levels. [Read more…] about How Hypothesis Tests Work: Confidence Intervals and Confidence Levels
Nonlinear regression is an extremely flexible analysis that can fit most any curve that is present in your data. R-squared seems like a very intuitive way to assess the goodness-of-fit for a regression model. Unfortunately, the two just don’t go together. R-squared is invalid for nonlinear regression. [Read more…] about R-squared Is Not Valid for Nonlinear Regression
R-squared tends to reward you for including too many independent variables in a regression model, and it doesn’t provide any incentive to stop adding more. Adjusted R-squared and predicted R-squared use different approaches to help you fight that impulse to add too many. The protection that adjusted R-squared and predicted R-squared provide is critical because too many terms in a model can produce results that you can’t trust. These statistics help you include the correct number of independent variables in your regression model. [Read more…] about How to Interpret Adjusted R-Squared and Predicted R-Squared in Regression Analysis
T-tests are statistical hypothesis tests that you use to analyze one or two sample means. Depending on the t-test that you use, you can compare a sample mean to a hypothesized value, the means of two independent samples, or the difference between paired samples. In this post, I show you how t-tests use t-values and t-distributions to calculate probabilities and test hypotheses.
As usual, I’ll provide clear explanations of t-values and t-distributions using concepts and graphs rather than formulas! If you need a primer on the basics, read my hypothesis testing overview. [Read more…] about How t-Tests Work: t-Values, t-Distributions, and Probabilities