I’m thrilled to announce the release of my first book! Regression Analysis: An Intuitive Guide for Using and Interpreting Linear Models.
If you like the clear writing style I use on this website, you’ll love this book! The end of the post displays the entire table of contents!
Over the course of this full-length book (338 pages), you’ll progress from a beginner to a skilled practitioner. I’ll help you intuitively understand regression analysis by focusing on concepts and graphs rather than equations and formulas. I use everyday language so you can grasp regression at a deeper level.
You will learn practical tips for performing your analysis and interpreting the results. Feel confident that you’re analyzing your data properly and able to trust your results. Know that you can detect and correct problems that arise.
Regardless of your background, I will show you how to perform regression analysis. Students, career changers, and even current analysts looking to take your skills to the next level, this book has absolutely everything you need to know for regression analysis.
Buy it on Amazon (US site)!! Or go to my Web Store for other locations.
My Book Covers a Lot About Regression Analysis!
In this book, you’ll learn many facets of regression analysis including the following:
- How regression works and when to use it.
- Selecting the correct type of regression analysis.
- Specifying the best model.
- Understanding main effects, interaction effects, and modeling curvature.
- Using continuous and categorical (nominal) variables.
- Interpreting the results.
- Assessing the fit of the model.
- Generating predictions and evaluating their precision.
- Checking the assumptions.
- Examples of different types of regression analyses.
- Downloadable datasets so you can try it yourself.
- Answers common questions and concerns I’ve encountered over the years!
I’ve literally received thousands of requests from aspiring data scientists for guidance in performing regression analysis. This book is my answer – years of knowledge and thousands of hours of hard work distilled into a thorough, practical guide for performing regression analysis.
You’ll notice that there are not many equations in this book. After all, you should let your statistical software handle the calculations so you don’t get bogged down in the calculations and can instead focus on understanding your results. Instead, I focus on the concepts and practices that you’ll need to know to perform the analysis and interpret the results correctly. I’ll use more graphs than equations!
Don’t get me wrong. Equations are important. Equations are the framework that makes the magic, but the truly fascinating aspects are what it all means. I want you to learn the true essence of regression analysis. I want you to understand the essential concepts, practices, and knowledge for regression analysis so you can analyze your data confidently. That’s the goal of my book.
I hope you’ll enjoy it! Table of contents are below! Buy it on Amazon (US site)!! Or go to my Web Store for other locations.
Bought an ebook about 24 hours ago. This book is just brilliant! Reads like I’m having a conversation with my research supervisor. Thank you Jim.
Thanks so much, Pickett! So glad to hear that it was helpful!
Hi Jim, do you cover mixed effects models concepts in your book? Trying hard to understand definitions of random versus fixed effects and how to properly build this type of model in R.
Hi Jim, may I know what is the city that you published this book? This information is needed as part of the citation. Thank you!
Hi Camie, in the references section of the book I include a recommended citation for it.
Dear Jim, thank you so much. Your response had been useful. I have also read some of your posts and they are very useful and interesting. You have the capability to present complex statistical concepts in simple non-statistical jargon. Thanks.
Dear Jim, I hope you would also help me in answering one question: .
I have four categorical variables: education (primary=0, secondary=1, tertiary =2 and higher=3), place of residence (urban=1, rural=2), wealth (poorest=1, poorer=2, middle=3, richer=4, richest=5), empowerment (low=1, medium=2, high=3). I want to create a new variable by multiplying these four variables: education*residence*wealth*empowerment, giving a variable with 120 categories. How do I label these categories automatically in Stata? Or how do I know that for example it represents no education and poorest and low empowerment and urban area? I want to know the name of combinations of each category of the variable so that I would be able to analyze my analysis by the categories.
I would appreciate your help.
Regards,
Hello Dr, thank you.
I have one question.
In my cross tabulation, I have two categorical variables, one binary (yes, no), which is my dependent variable and the other has four categories. When I cross tab them, once cell ( that is, the “yes” of one variable and category #4of th other variable has a cell that has only 8 observations. Can I collapse/merge some of the categories of the four category variable into 3 or four for the cross tab? Could I use this variable ( that with small observations in one cell during crosstab) in multiple regressions without collapsing its categories?
your help is much appreciated.
Hi, eight observations in that crosstab group is pretty low but it might satisfactory. Chi-square tests typically require a minimum group size of 5. While you’re not performing a chi-square test, it’s probably a similar require. Unfortunately, I don’t have a reference for that. You might be able to use the categories as they are.
You can collapse the categories. It might help because the large number for that combination provides a more precise estimate of that combined condition, which gives more statistical power. Ideally, you’d define your procedures, including collapsing the categories, in advance to avoid the appearance of manipulating the results. However, you can give it a try to see the effect. In your writeup, you’d need to explain what you did and why.
Kind of.
Regression analysis is used to predict values, but whats left to predict if “it’s not safe to extrapolate outside that range”?
Hi Bengt,
The problem is that the relationship can change outside the range of your data, and you’d be completely unaware of that fact–because you don’t have data for that region. So, yes, you can generate predictions outside of your sample space easily. All you need to do is enter the input values into the equation. But you really don’t know how trustworthy the results are. One common guideline is that it’s ok to go 10% outside of your sample space.
Typically, you only want to make predictions within the ranges of your data because that’s the portion to which your model applies. You understand the relationships in that region because you have the data to model them with those values. Again, outside that range, the relationships can change!
Hi Jim!
On page 54 in your book you write: “Statistical software can’t take a categorical variable and directly analyze it. Instead, it converts categorical variables into indicator variables using a (0, 1) coding scheme.” Which software are you refering to? Excel and JASP does not seem to be able to do it for you.
Hi Bengt,
From my experience and reports by others, I know that Minitab, SPSS, Stata have this feature and I’m pretty sure JMP does.
I would not consider Excel to be statistical software even though it has statistical functions. I’ve written about How to Perform Regression Analysis using Excel.
I’m not familiar with JASP. I’ll take a look at it!
Hi Jim.
On page 43 in your book on regression analysi you write:
“The height coefficient in the regression equation is 106.5. This coefficient
represents the mean increase of weight in kilograms for every
additional one meter in height. This study sampled preteen girls in the
United States. Consequently, if a preteen girl’s height increases by 1
meter, the average weight increases by 106.5 kilograms.
The regression line on the graph visually displays the same information.
If you move to the right along the x-axis by one meter, the
line increases by 106.5 kilograms.”
106 kg per meter. Thats’ a lot for a preeteen girl. I don’t not understand how the diagram says that.
Hi Bengt,
Be sure to read the next two paragraphs where I explain that in more detail. Specifically, I explain how the data range from 1.3 to 1.7m and, hence, it’s not safe to extrapolate outside that range. We can’t shift by a full meter. The technical explanation is as I describe, 106.5kg per meter. However, data limitations constrain that to less than a meter. That’s also an average. As you can see in the graph, some individuals are above and below the regression line.
I hope that helps clarify it!
Hi Jim,
I want to better understand the marginal effects of various regression models. I also have difficulty in interpreting chi-square statistic. I am not from maths or stats background. Can you suggest resources to learn and understand these skills. Thanks
Hi Jim,
I am currently making my way through your new book and articles on here. These are very helpful and well explained. I do have a question that I am finding difficult to find the answer to within these resources.
Currently I am doing a systematic review of the predictors of clinically significant weight gain amongst those who take antipsychotics. This is usually defined as a 7% or more increase in weight. My inclusion criteria will allow for both RCTs and non-randomised studies of interventions. My question relates to distribution of predictor variables. My understanding is that randomisation will allow for even distribution of potential confounders and baseline characteristics of participants. In an observational study, this may not be the case. My question relates to whether I should be prioritising RCTs over non-randomised studies of interventions because of this. As regression analysis will just be conducted amongst those who are treated with the drug, will baseline data, some of which might function as a predictor variable e.g. sex, be more evenly distributed if the sample arises through a randomisation procedure or not?
I may be over-complicating this, as if I include for sex in the model – perhaps if accounts for this even if males vs. females are not evenly present within the sample?
If the answer is within a blog post or chapter of a book of yours, please direct me. Any help you could give would be much appreciated – you can probably tell I am quite confused!
Thank you,
Ita
Hi Ita,
I write about this extensively in my Introduction to Statistics book. But, yes, when you are performing an experiment, randomized controlled trials (RCTs) are the best. However, they’re not always possible. Observational studies are an option and you can control for confounders by including them in the model. However, the case for a causal relationship is always weaker with observational studies. And there’s always a chance that there are confounders that you don’t include in the model. Maybe because you didn’t even measure them! But, yes, if you can’t perform an RCT, you’re thinking along the rights by identifying confounders and including them in the model.
I’d recommend my Introduction book because I talk about the entire issue in depth along with the pros and cons of RCTs vs. observational studies. In my regression book, which it sounds like you have, I write about confounding variables and tips for handling those in chapter 7. But I spend more time comparing and contrasting RCTs and observational studies in my Introduction to Statistics book.
I hope that helps!
Thank you Jim!
I’m sorry… you’re absolutely right about the transition from Adj SS to Adj MS! My problem was primarily regarding the ‘Experience’-row, as both my ‘Regression’ and ‘Major’-row were the same as yours, but given your answer it must be, that the ANOVA-function of my program is coded wrongly.
I guess I have to figure out how my program is entering the values in its ANOVA-function or fiddle around with the function itself!
Thank for your time and, again, some fantastic intuitively books!
Have a great one! Sincerely,
You’re very welcome! I’d start by checking which baseline value your program is using for Major.
Hey Jim!
Thanks for some awesome books!
In your regression book, I’ve found something unexplainable. On page 68 in your statistical output table for an ANOVA test using your categoricalexample.csv data, I believe the ‘experience’ row is way off regarding the numbers. The ASS and MSS is the same, which doesn’t make any sense and when you sum the ‘Major’ and ‘Experience’ rows regarding ASS and MSS it doesn’t add up to the ‘Regression’-value as expected.
In addition, I get something else when running the ANOVA myself regarding the ‘Major’-row.
Can you confirm this is a mistake in the book, or is my statistical knowledge/output/program just playing games with me?
Sincerely, Peter.
Hi Peter,
Thanks! I’m glad you’re enjoying the books!
Those are all great questions because you have an eagle eye and you’re going beyond understanding only the most commonly used output. Yes, those values are all accurate!
Regarding the Experience row. Keep in mind that the Adj MS value is simply the Adj SS/DF. For Experience, the degrees of freedom is 1. With 1 in the denominator, those two values will be equal!
At first glance, you’d expect the Adj SS values to add up to the total. However, keep in mind that these are the adjusted values, which means that they are calculated for each variable after all the other variables are entered into the model. Technically, they’re not even from the same model. The variables are the same but the order they’re entered into the model is different. For this example, the adjusted values for Experience are calculated after Major is entered in the model. The values for Major are calculated after Experience is entered.
That fact is often overlooked but it is a crucial aspect throughout regression. Remember that the interpretation for a coefficient is that it is the mean change in the DV given a one-unit change in the IV after accounting for the effects of all other variables in the model. That’s true when you use the Adj SS, which is the standard procedure.
I can’t be sure why your program is giving you different answers for Major. However, if I had to guess, I’d say it’s because it’s using a different reference level. For my example, I use Statistics as the reference level for Major. If your application is using a different reference level, you’ll get a somewhat different answer.
Thanks for writing!
Hello! I was wondering what type of test you would recommend to analyse a categorical DV and two categorical IVs. All have 4 levels.
How do I reference your eBook & comments?
Hi Godwin,
Here’s the reference that I’m recommending. I’m actually in the process of adding the recommended citations into the books themselves for everyone’s convenience!
Frost, J. (2019). Regression Analysis, An intuitive guide for using and interpreting linear models. Statistics By Jim Publishing.
Thank you Sir for the sample copy.
I’m interested in this book, does it cover poisson and negative binomial regressoin
Hi, I cover the conditions under which you’d use these types of regression but I don’t go through examples of them. This book is mainly about least squares regression, but I do talk about when to use other types and many of the same principles apply to them.
Hi Jim,
Thank you for sharing your knowledge through this blog. It’s really helpful!
I’d like to ask if you know the Wald test for testing equality of regression coefficients to compare regression coefficients between groups. Is it discussed in your new book? I submitted in a peer-reviewed journal a research article where I used multiple regression modelling in understanding the technology integration in science and mathematics teaching. I have two separate models for science and mathematics teachers. The reviewer wanted to know if the regression coefficients of the independent variables differ between the science and math teachers.
Thank you in advance.
I’d like a paper copy (of both books) as well. If you print them, I will buy them. Thank you.
Greetings Jim,
I love the simplicity and yet rich content of your e-books. I would like to make an order of some of them. Would you kindly email me your complete list. I will be grateful.
Getrude
I have data of wildlife electrocutions on power lines (>300 events). The IVs of the individual electrocuted would be species, age, weight, rainfall on the day that the individual was electrocuted. I am looking to model that data to see which IV increases the model fit. But I am confused on how to input the DV as it appears to be yes to occurrence of the event, but there isn’t any other level such as no there wasn’t an event. So how do I code the DV? Thank you!
Jim,
once we got regression curve through Minitab, with whatever model. is there any way we could extend such curve automatically in Minitab? for example, change range for independent variable.
Thanks.
Hi,
Typically, you don’t want to extend the curve beyond the observed data. It’s possible that the relationship changes beyond your observed dataset. Consequently, I don’t believe there is a way to extend the curve beyond the dataset in Minitab. You could graph the function using other software. However, again, be wary of trusting predictions beyond the range of observed data! That’s a highly suspect practice.
Hi Jim,
What additional read does your book offer when compared with ISLR ?
Congrats on your book. I just visited your blog and find your explanation very clear and useful. Btw, do you also discuss about fixed effects regression in your book? Thank you.
Hi Akhmad,
Yes! In fact, because this is an introduction book to regression, all the effects are fixed effects. Mixed effects models are something that would be covered in a more advanced book.
Hello sir i started to learn data science i was searcing some good start for statistics after lot of googling i got today your website it is really good stuffs here. thanks for your hard work to share knowledge in layman term
Hi Mr. Frost,
I hope this message finds you well.
Could you kindly response to this question?: is it problematic to conduct several different multiple regression models (something along data mining?) until the researcher finds a model to their liking? You hinted at this in your book and i hope you can clarify?
Hi Alice
Thanks for buying my ebook.
Yes, I cover that issue in my book. That’s known as datamining and takes advantage of chance correlations in the data. These chance correlations exist in your sample but don’t actually exist in the population. The concern for this problem increases when you trying fitting many models and then pick ones based mostly on statistical significance rather than letting theory guide you. Look in the book for the section about datamining. Also, in the section about how to specify the model, I write about the importance of letting theory guide you and not relying solely on significance.
I hope this helps!
Great book Jim,
Very intuitive and easy to read.
I just noticed a typo on page 73 where you used the subheading “The case of including it as a continuous variable ” twice.
Thank you, Waqas! I’m glad you’re enjoying the book!
I’ll fix the typo. Not sure how I missed that. Thanks!
Hi Jim. What kind of a mathematical background do you need to grasp this book? Should I already have taken a statistics course? Or do I need calculus or what?
Hi Todd,
You definitely don’t need calculus. A tiny bit of algebra would be helpful for the context about equations for lines. A little knowledge about polynomials and transformations like the natural log might be helpful but not required. I explain them in the book.
A bit of basic statistical knowledge would also be helpful. Such as a basic understanding of p-values, hypothesis tests, confidence intervals, and correlation. Again, it would be helpful to start with some of that knowledge, but I do explain how those concepts apply to regression.
My book focuses on the practical usage of regression and understanding the concepts. It doesn’t focus on the equations behind how it works. There are a few equations where they help explain concepts. My goal is to get readers to be able to understand what regression analysis does, understand it conceptually, and then know how to use it correctly, identify and resolve problems, interpret the results, and defend the results.
In short, I don’t assume extensive knowledge about math or statistics. But, a few basics for both would be helpful.
Buen día, me alegra saber que te ocupas de la interpretación de los datos de regresión, ese tema es el que más le cuesta a los estudiantes, lo veo con mis alumnos. Felicitaciones Jim! Saludos desde Jujuy, norte de Argentina. Graciela.-
You should sell this in Amazon! I am going to get this.
Thank you, Anoop!
Hi Jim, could you please show the table of content of the book?
Hi David,
Thanks for writing. I just added screenshots of the complete table of contents at the end of this post (after the text but before the comments section).
You can also get a free sample of this ebook that you can find in My Store. This sample contains the full Table of Contents and the first two chapters.
What’s difference between regression and ANOVA model ?
Thanks for clarifying Jim!
Hi Jim! Have you used any platform to show these concepts? Python or R or Matlab?
Yes, I use Minitab. However, the output I show will be recognizable to users of all other software (coefficients, p-values, ANOVA table, many graphs, etc.). This ebook is more about teaching how to use and interpret regression analysis, identifying and resolving problems, etc. rather than teaching a particular analytical package.
Hey Jim,
Any chance you could turn this into an actual paper book? I haven’t had much luck reading ebooks in the past. Thank you for contributions.
Hi Bhavik,
If there’s enough demand for a print version, I’ll create one in the future! Thanks for writing!
Hey Jim! Congrats on your fist book. Would you say it would be helpful for someone dealing with lots of econometrics? I would definitely appreciate some intuitive explanations for penalized linear regression methods, logistic regression interpretations and so on… 🙂 Thank you!
Hi Sasha, thank you! It is good for someone who is studying econometrics. In fact, going way back, my first experience with regression was in an econometrics class. However, I do not cover penalized linear models as that is more of an advance method. I do include an example of binary logistic regression, but not in-depth. Perhaps I can cover those more advanced topics in the a followup book!
Awesome! I can’t wait to read this book. I’ve been waiting for you to write a book for a long time now. You are, hands down, the best stats teacher I’ve ever come across. Where were you when I was struggling with this subject in university?!
Hi Leonard, thank you so much! Your kind words made my day! I hope you enjoy the book!
Dear Jim,
I’ve bought your book, at least as a support of your work that you are doing here on this website. Your posts are very useful and provide me answers that I cannot find in any books about regression analysis.
Best wishes,
Jiri
Hi Jiri, Thank you so much! I appreciate that tremendously! 🙂
Hi Jim,
Congrats!!!
Thank you, Ciro!!
First, congratulations!
Secondly, can users of other software find benefit from using your book to grasp the concepts and contents ?
Hi Eric,
Yes! The main goals of the book are to teach the concepts, best practices, model specification, interpretation, assumptions, problem solving, avoiding common traps, and so on for regression analysis. In short, I want readers to learn the skills for using regression. These goals all apply to regression analysis regardless of the statistical software someone uses. One thing I don’t do is detail the procedures for performing all of this in Minitab. That’s not the focus. The Minitab output I show should be relatable to users of other statistical software packages. Lots of graphs. And output tables that are similar to other applications such as ANOVA, coefficients, goodness-of-fit, etc.
In my book, I use the same approach that I use throughout my blog posts. So, those should give you a good idea of what to expect.
Congrats! Such an achievement
Thank you!
As a student, I prefer to study through books. So please let me know when the print version is available. And of course come to India once again, Sir. We will be happy to see you.
Hi Ratnadeep, if there’s enough demand for one, I’ll create a print version in the future. Thanks for writing!
Hi Jim! Congratulations on your new book! Looking forward to getting it!
Are there any plans to publish it in print as well?
Hi Chuck,
Thank you and thanks for ordering it!
If there’s enough demand for a print version, I’ll create one in the future. It’ll involve a bunch of work on my part to get that ready.
Hello Jim.
Please what kind of program do you use in this book to perform your Regression Analysis? I am hoping for STATA. ☺
Hi Howard,
I use Minitab throughout the book. I do like the idea of creating companion books for different software packages.
Looks great – I’ll certainly buy a copy today as your way of explaining statistics has been a lifeline on so many occasions during my PhD.
Congratulations on your first book – a massive achievement!
Hi Neil,
Thank you so very much for your kind words. I’m glad to have been of assistance on your journey to a PhD!
Hi Jim, I am a lecturing econometrics. How long will it take to get the book if I pay now
Hi,
It’s an ebook so you can download it immediately!
Hi Jim,
I am your great fan of statistics.
You explanation level is awesome.
Happy to hear about your ebook.
Hearty wishes to you.
Rajesh J
Hi Rajesh,
Thank you so much! I really appreciate your kind words!
Best wishes for you as well.
Congratulations
Thank you, Salwa!
Hi Jim,
Congratulations!
Thank you very much!! 🙂
Hello Jim Sir, I am a post graduate student from India. Your blogs helped me a lot to understand lots of complex topics. I hope the book will also serve the same purpose. I need this book. Kindly let me know the price in Indian currency.
Hi Ratnadeep,
I’m so happy to hear that you found my blog to be helpful. I strive to make complex topic easier to understand. I use this same approach in my book as I do in my blog posts, so I think you’ll like it!
I Googled a currency converter and found that the book is 627 Indian Rupees.
By the way, I’ve had the pleasure of traveling to India several times and loved it! 🙂
Congrats..I was eagerly waiting for your book.Is it available on Amazon
Hi Adil,
Thanks you! Currently, it’s only available for sale on my website. Were you looking for a print version?
Is this available in hard copy too ?
Hi Ishita,
It’s not currently available in hard copy. However, if there are enough requests, I might create a physical book version in the future.
Thanks for asking!
Congrats! Will get my copy soon and hope to recommend it to my students
Thank you!
Congratulations🎉🎊
Thank you, Hemant!
I love your style in explaining statics. Nice to see the eBook now in my laptop.
Thanks for the great work.
Hi Sheng-Leun,
Thank you so much! Both for the kind words and buying the ebook. 🙂
Enjoy!
Very nice to see this. Eager for it and will by soon
Thank you, Tesfakiros! I really appreciate that! 🙂