Linear regression is a useful tool for modeling quantitative responses. It has been around for a long time but it is still a widely used technique. In this episode, we discuss what linear regression is, what we can achieve and how to interpret the results. We start with the base case scenario with one continuous predictor and then discuss the inclusion of a binary predictor. Other interesting topics is discussed like the goodness of fit, adding higher-level terms, working on the log scale, and diagnostics.

The basic principles of regression models are covered, including ordinary least square estimates, linearity, interpretation of the parameters, the numerical stability of the fitting procedures.

In this GitHub Repository, you will find R scripts helping to understand the basics of linear regression.

Other resources:

Regression Modeling Strategies by Frank E. Harrell

Linear Models with R by Julian J. Faraway, 2nd Edition

**Transcript:**

[00:00:00] **Alexander:** You are listening to the Effective Data Scientist Podcast with Alexander Schacht and Paolo Eusebi.

[00:00:13] Has it ever frustrated you to not understand enough of the statistics part of your job? Does it feel hard for you to convince others of your. No judgment here. Through this show, you will understand statistics and how to implement it without having to study lots of textbooks. You will also learn how to win others over for your ideas.

[00:00:40] This will make you not only more successful at work, but also more satisfied.

[00:00:49] Welcome to this episode where we are talking about. Linear regression and an introduction to linear regression. [00:01:00] Paolo, when did you first came about linear regression in your life?

[00:01:05] **Paolo:** Yeah, sure. So because it was during my first course in statistics university, so my first degree was in economics. . But of course, statistics and quantitative methods played a big role in this curriculum.

[00:01:27] So some introduction into descriptive statistics. And then we jumped into Lena Regression. , and which I think was the first inferential engine we learned about. Yeah. Yeah, and I really like it at that

[00:01:50] **Alexander:** time. Yeah, it was similar for me. First, we learned about mean and standard deviation and things like that, and since [00:02:00] the next hour in the lecture we went to a linear regression, it’s a really powerful tool.

[00:02:07] and it’s something that really everybody should know about. And there’s a, but it’s, as powerful, it is as simple it is. You can make it actually quite complex. And even with the simple stuff says some of the things to take care of because on, in theory, lots of things look really easy.

[00:02:28] But here we’ll also talk about the practical aspects and what actually can go. Paolo, do you know where linear regression has been used or where it’s coming? .

[00:02:39] **Paolo:** Oh, yeah. I think it was mainly everything started in the 19th century with LA and gs, these big people, discovering new ideas, proposing new methods in.

[00:02:55] This challenging time without software and [00:03:00] other stuff. Galton, I think, did some regression analysis. Proposing some inference on the eighth of the children as compared to the eighth of the parents. . And this where the progression comes from the divorce progression because Galton discovered nice thing that, you have some correlation, let’s.

[00:03:33] between the eighth of the children and the eighth of the parents. But this correlation is not perfect because we have this regression to the mean effect. That at that time Galton called regression to mediocrity. in the sense that what Cal Don know was a not duck for that time. He work at different areas, maybe.

[00:03:58] Not all the work [00:04:00] from Galton was a hundred percent valuable because he proposed also something about Eugen’s genetics and eugenics and stuff like that. But I think that it did a lot for regression. Yeah. And he gave also the name to this. Model basically it’s the model for predicting, and describing what we have in one variable, having some information in other variables.

[00:04:28] **Alexander:** Yeah. Yeah. So here it’s pretty clear, the head of the parents influencers say hate of the children and not the other around, but very often it’s actually not that clear in terms of. What is driving what, yeah? So there you need to have a little bit more of a business understanding to really make sure, okay, what is driving what, who, what is the outcome parameters that you are really interested in?[00:05:00]

[00:05:00] And so that is something to have a little bit of thought. Then it’s now the other thing, it’s called linear regression, and I think when we. Think about linear regression. Yeah. We think about this scatter plot where we have, the two variables that we are talking about as vertical and horizontal an axis.

[00:05:25] And then we have CS dots that represent all the different, samples that we have. Yes. The data that we have. And there is this line through. Yes, that describes linear regression. And now but what does linear mean in terms of mathematical words? Actually, It’s

[00:05:48] **Paolo:** the relationship between the Covet Predictor independent variable and the outcome, which is ours why.[00:06:00]

[00:06:00] So if we have many variables or if we have a single variable, the effects of these variables on the Y can be described as a. of different effects. . Yeah.

[00:06:19] **Alexander:** Yeah. And it’s linear because it’s linear in its parameters. Yeah. Yeah. So that is, I think a very important point. It’s linear in the parameters.

[00:06:30] So if you think of why equals. Plus better X. Yeah. And y and X are your different variables that you’re interested in. Then it’s in, it’s linear and moo and better. So moo is your intercept and better is your regression slope. It’s not about that, the excess linear, because you could also have something like X square or [00:07:00] a square root of X or lock X or something like this, you would still call it a linear model.

[00:07:06] Yeah, because it’s linear in the parameters. So I think that is something to have in mind. What wouldn’t be a linear model, for example, but if you would have a new plus? Better square times X. Yeah. It’s not linear anymore. Yeah. So something to have in mind. Let’s talk about the interpretation of these parameters so you have a better.

[00:07:34] How can you best describe these parameters?

[00:07:39] **Paolo:** Mainly we can say that mu is the intercept. And Peter is the progression slope. So if we consider just one variable on the right hand of the equation, [00:08:00] we can see that move is what we have in general. Considering or not considering the effect of X in how y and beta is the change in Y when we have a unit change in the X.

[00:08:21] Yeah. And so it’s quite simple to interpret.

[00:08:25] **Alexander:** Yeah. And here now it really depends on how you put your data into the model. Yeah. For example, if let’s say, the eight. . Yeah. For example, again, you have the hate of the parents as a covariant. Yes. The X here and Y is the hate of the children.

[00:08:46] If you just, add that in terms of centimeters Yeah. Into your model, then Moo would give you the hate of children. The parents are [00:09:00] zero centimeters higher which may not be that helpful. Yeah.

[00:09:04] **Paolo:** So what, yeah. It doesn’t make any sense. Yeah. For me, yeah. So what

[00:09:08] **Alexander:** you could do is instead of having the data, being put into centimeters, yeah.

[00:09:15] You could just, subtract the mean hate from all the patients. Yeah. Or

[00:09:21] **Paolo:** the or maybe some, Minimum value. Yeah.

[00:09:25] **Alexander:** Or the minimums of the Yeah. For the range. Yeah. And then if take the minimum value then you know that Okay. Mu is the average hate of the children for the smallest parent that you have in your sample.

[00:09:40] Yeah. Or if you’d subtract by the average. Is the estimated hate of the children. They are at the average hit of the parents. Yeah. And so that is a, let’s say, very simple trick that you can use to [00:10:00] make your parameters more interpretable. Yeah. The other thing is in terms of the better the regression coefficient that gives you the increase in terms of here’s the height of the children for a unit increase in the height of the.

[00:10:19] Parents. Parents, yeah. So if you put it in centimeters Yeah. And you have, let’s say better of oh 0.9 than, okay. For any increase in hate or for a one-centimeter increase in hate on the parents, the children increase their hate by oh 0.9 centimeters. If you put in. Meters. Yeah. Then you get oh 0.009.

[00:10:52] Yeah. Hundred oh 0.009. Yeah. Yeah. On the hundred. Its easy coefficient is divided by 100, [00:11:00] so it changes with the scales that you put in. And that is really important to have in mind. If you want to have easy, interpretable numbers. Yeah. Especially if round things and things like this are to display on a table.

[00:11:13] I very often see that people don’t, Meaningful units because they could be, everything. Yeah. If you think about it, let’s say you have sales as something in there and dollars. Yeah. Then you probably don’t want to. Get it in terms of units of $1 sales increase. Yeah. But maybe you wanna get it in units of 1 million sales increase, or 10,000 sales increase, or whatsoever.

[00:11:40] Yeah. Have a look into this so that the parameters actually make sense for you.

[00:11:45] **Paolo:** And this is especially useful when you have more cova. In your model and when you look at the table, maybe you have some parameters. In terms of, I don’t know, maybe [00:12:00] 20. Then you have the other parameter, which is 0.0003, and stuff like that.

[00:12:07] So in general, knowing that you can transform your variable and subtract and divide and stuff like that, it can help in the overall interpretation of the model. and in general we also have some help in the numerical stability of the modeling. So if your Covas is less or more on the same scale, let’s.

[00:12:40] the numerical stability of the fitting procedures is a bit better. And tend to be more stable. Yeah.

[00:12:47] **Alexander:** Yeah. Yeah. So one thing to really have a comparable understanding here is if you have a couple of different arias Yeah. And all a linear, what you could [00:13:00] do is normalize all of these.

[00:13:02] Yeah. First, subtract simin in all of them, and then divide by the sensitization of all of them. Yeah. And then for all of the parameters. Yeah. You have you measure your move at the average of. All your covariates and every regression coefficient correspond to, a change in one standard deviation.

[00:13:31] Yeah. Yeah. And that is maybe something you can think about. And then, so, therefore, then you get somehow Easier maybe to compare progression coefficients. Yeah. I think that’s the point of view but yeah, then of course it always depends on your sample. Yeah. So that’s the other thing.

[00:13:51] **Paolo:** Yeah. And depends maybe on the purpose. So maybe you can do both because Yeah, you can have your coefficient, which is [00:14:00] interpretable. And you can sell your table to your stakeholder. And your stakeholder can easily interpret that. If the change in investment is this amount, then you have this amount in return.

[00:14:17] But then you have also this other interpretation with the. Partial correlation with visions. , and then you can compare the parameters with each other. So you can see what covert is maybe more important in the model in terms of the strengths or strength of the association because everything is normalized.

[00:14:42] Yeah.

[00:14:43] **Alexander:** Yeah. Basically. . So in terms of the strength of the impact and the model itself? Yeah. So linear regression is a model, and like box, one said all models are wrong, but some are useful. [00:15:00] It’s really important to check how wrong your model is. Yeah. So we often just assume that this model is linear.

[00:15:09] Yeah. But. Very often people forget to actually check it. Yeah. I always recommend that you have a scatter plot and you put your line through it too, to at least see whether there’s, anything going wrong there and when, whether this line really represents the data. So that’s an important first step to have a look

[00:15:33] **Paolo:** into.

[00:15:34] Yeah, this is an important first step. Then we have also a bit, maybe some more advanced approaches. , because this is quite useful if you have one to cover it. But maybe when you have more coverts it becomes challenging to see the impact of one covet on the [00:16:00] Y variable without considering the others because.

[00:16:03] Everything is influenced in this relationship also by other coverts in your model. And we have these graphical devices like the partial regression plots or partial receival plots when, when basically, you can isolate yours. And your ex without considering the influence of the other covers.

[00:16:34] Yep. So it’s something that you can maybe check but often. In your daily routine in terms of modeling the first plot you mentioned is the basic, standard and useful approach.

[00:16:48] **Alexander:** Yeah. Let’s talk a little bit further about this plot because you’ll see a couple of things in there and you just mentioned this.

[00:16:57] This word residual? Yeah. [00:17:00] Residuals are basically the errors that you can’t explain why as your linear model. Yeah, so it basically is the difference between what you can explain through the mood plus better. . Yeah. And your y Yeah. Let’s say you have children again. Yeah. A child again. And that is 1.8 meters high.

[00:17:30] Yeah. And based on your model. Yeah. So move plus better X for the parent. Yeah. Based on that, you would get 1.85 meters. Yeah, so 1.8 is a real measured eight, and 1.85 is what you calculate through the model. Then there’s this difference of five centimeters. That is your residual.

[00:17:58] Yeah. And [00:18:00] so you can see that in your scatter plot when you look into the difference on the, Vertically the difference between the different points. The different data points and your near regression line, are residuals, and just as a side note, the linear regression is actually fitted, so that C square of these residuals, so some of the squares of these residuals is minimal.

[00:18:33] That’s why it’s also called the smallest summer square approach.

[00:18:38] **Paolo:** Why? Square in the regime Als and not taking justice. They are Oh,

[00:18:43] **Alexander:** that, that is actually a very interesting thing. I actually had a couple of mathematical statistics called, oh, our support that, this square, is really.

[00:18:55] A very unstable thing. Yeah. Instead of taking to the power [00:19:00] of two, you take to the power of two plus minus a little bit, it already doesn’t work anymore. There’s something magical about this too, so to say. Yeah. So it makes all these different things work. Yeah, it’s really interesting.

[00:19:17] And if you really want to learn more about that then type into some meta-mathematical statistics and

[00:19:22] **Paolo:** then I think that maybe we don’t care about the sign on the receipt walls because it can be negative. It. Positive. But in the end, we are mainly interested in the fact that we have this magnitude of the rec receivable.

[00:19:46] Yeah. Which is bad. Yeah. In general, if we have large receivable. And even if these residuals are changing, looking at different dimensions,

[00:19:59] **Alexander:** just, [00:20:00] yeah. Yeah. So what you’re also basically minimizing here is that you’re minimizing the variability of these drills. Yeah. So if you think about the variants of the residuals themselves, they.

[00:20:17] Minimal forces, linear regression. So the mu the better are chosen in such a way that the variance of the residuals is minimal. And that’s another aspect to have a look into.

[00:20:32] **Paolo:** Yeah. And then we have this nice. Property that this list squares, this ordinary list squares estimates cause you, you get your beta and moo by minimizing the sum of the squares of this receipt voltage and is the best linear, unbiased.

[00:20:57] Estimator we can get with [00:21:00] any other Yeah. Approach.

[00:21:02] **Alexander:** Yeah. And unbiased means that on average you get what you want to get, so to say. Yeah. In plain language. And the other thing is also these estimates that you get here what your Python or whatever kind of code puts you out there has some really nice.

[00:21:23] Say are actually normally distributed. And so they follow a normal distribution and based on set we can do all interesting things with it. That is, so much about an short introduction to linear regression. We talked about where is this coming from? We had a really nice example with the parents and the kids and that you can do lots of other things about it.

[00:21:51] We talked about the know the parameters, the moon and the better and how it actually works. So I hope you [00:22:00] liked this first episode. In the next episode, we will. More deeper and into linear regression because it’s such an important topic for data scientists, and there’s a lot of further things you can actually learn about.

[00:22:17] So if you enjoy this episode, then listen to this next one and tell your friends and colleagues about us.

[00:22:30] Thanks for listening. Check out the show notes at theeffectivedatascientist.com. Did you like this episode? Tell your peers about it. Thanks to Reine who helps us with the show in the background. Act on what you have learned and be an effective data scientist.