In this episode, we continue to dive into the linear regression model. What are the real-world applications? When our model fit is enough? What are the pros and cons of increasing complexity in our model?
We discuss also the basic principles of covariates transformation (i.e. the logarithm) and how this has played a pivotal role in the modeling of the Covid-19 epidemics.
Furthermore, we discuss how to model and interpret interactions between covariates.
In this GitHub Repository, you will find R scripts helping to understand the basics of linear regression.
Other resources:
Linear Models with R by Julian J. Faraway, 2nd Edition
Regression Modeling Strategies by Frank E. Harrell
Transcript:
[00:00:00] Alexander: This is a second part about becoming familiar with linear regression.
[00:00:06] Welcome to the next episode of the Effective Data Scientist. Again, it’s Paolo and myself. Hi Paulo. How are you doing today?
[00:00:18] Paolo: Very Well, Alexander.
[00:00:20] Alexander: Thanks. Yeah. If you enjoyed the last episode of this podcast, you know that we are talking about linear regression today, and linear regression is a really important topic.
[00:00:34] If you wanna get to the introduction of it. Just scroll a little bit back on your podcast plan and go before this episode, and get a little bit of an overview. In this episode, we will build on what we discussed last week. So last week we talked about, this example of parents and kids and the type and, but of [00:01:00] course that could be any other parameters for example, sales versus investment, or it could be the number of email subscribers divided by the number of. Posts on social media or something like this. Yeah. All kinds of different variables continuous variables, that we look into.
[00:01:24] And so we talked a little bit about, what can go wrong last time, but today we wanna go a little bit further into this from the goodness of fit statistics. So if you have a linear. Yeah. And just, take it, for example, to exercise again with the height of the parents and the kids. How can you then, determine whether you have a good fit for your model?
[00:01:52] Paolo: I’m thinking about the hard square. Yeah, which in general [00:02:00] is a measure of feet of our model. Heat can be zero with no feet at all. So like our data points are just a cloud in our scatter plot and we have our regression line, which is a line running at the Intercept level, which just describes the mean of the data points. It’s really bad.
[00:02:31] Alexander: Yeah. So basically you have no slope whatsoever. Yeah. Yeah. So this means, it doesn’t care whether you take your cores into account. They have actually no influence whatsoever on, the variable of interest.
[00:02:47] Paolo: There is no provincial ship to describe and report. And then has a square can increase up to one, which is the maximum. And if you have this fit all your data points are on the regression line. So there is a perfect regression describing what you have in your data set. So your receivables are zero. Yeah, everything is perfectly described, so it’s something that never happens.
[00:03:27] Alexander: In real life, usually then, you actually have intrinsic correlation in there, I don’t know, sales and dollars versus sales and euros or something like that. So something completely obvious. Yeah.
[00:03:41] Paolo: Or maybe I found something like that in biochemistry. If you, sometime you have this experiment. So maybe you have some concentration, I dunno. And you measure some quantity, then you have some duration, and then you measure once again this quantity. And sometime you expect that there is some perfect correlation between the concentration and the protein or stuff like that. Maybe your example about economics is very, because mainly your R square depends on the model, but maybe also on the fields, on your research fields, on your research question.
[00:04:25] Because when I was at the university studying economics, of course, if you study correlations between economic quantities like employment versus economic growth. Then of course you have a strong relationship and maybe your R square is something like 0.7, 0.8, yeah. Or 0.9 sometimes.
[00:04:51] Alexander: Also, if you look into physics and you do physical experiments, yeah. Very often you have [00:05:00] very high R squares. But if you, for example, look into psychology And then you very often have very small R squares. Yeah. And that’s very often the best you can do. So if you look into, let’s say, whole EQ or IQ Yeah.
[00:05:23] How that determines income. Yeah. Of course, there are lots of other factors influencing income, and so you know, there may be only a very small last sphere, but E even that is of interest. Yeah. So it always depends on in which field you are. And so don’t rely on these rules of some searches. R Square is small, R square is big, and this R square is medium.
[00:05:53] I think that really depends on where you are looking and also what the business environment is. Yeah so [00:06:00] for example, is it big enough to invest something here? Yeah. Is it big enough, that it really drives some decisions? Or is it, on a scale that actually it gets lost in surroundings here and we shouldn’t care about? Yeah.
[00:06:18] So here, understanding these business environments is really important to assess this goodness of it. So don’t only think about, statistics. outcomes here, but take into account why what can happen here and see the bigger picture.
[00:06:36] Paolo: Maybe R Square could be useful if you run different models for answering the same question. Then you can see how your feet increase depending on the. Increasing I don’t know, the number of types of regressors. [00:07:00] Or maybe just because you are doing something right with the transformation of your regression, stuff like that. And comparing your models within the same data science exercise could be useful.
[00:07:17] Alexander: Yeah, so there’s actually, that was one of my first exercises I did in my first kind of hands-on statistics course I had this, linear relationships that we looked into. and then, we were thinking, okay, we could have just this line, but maybe it’s not a line may, maybe it’s some something different.
[00:07:41] And so a good thing to have a look into is whether you would like to add higher-level terms in here. So maybe it’s a quad or cubic approach. And you can very easily choose this by just [00:08:00] adding further variables to your dataset. Yeah. So if you have, let’s say, the height of the parents, there you are also adding height to the power of two, to the power of three.
[00:08:12] To the power of four. Yeah. Or. One divided by height or lock of height. Yeah. As variables to your dataset. And you put these also into the regression line. Yeah. And then the queue, which Paolo talked about. You can see whether this R square actually increases or not.
[00:08:36] Paolo: In general. It’s art, at least for me to go beyond the square. Other than that, it becomes really difficult to guess what we can have, at least for me, maybe listeners or you can have a better understanding of what’s happening. Just looking at the scatter plots.
[00:08:58] Alexander: So there’s actually an interesting research area that is called Fractional Polynomials.
[00:09:06] It’s maybe a little bit more advanced, but there are these publications by Royston and Supri, and they also have a lot of things publicly for free. And their idea is that trust, I think with eight different powers, so to say. Yeah, so I think they talk about X square, an X cubic linear term, a locked term, an inverse term, and one over X square.
[00:09:38] I think that was it. Just looking into these is probably sufficient to model pretty much all of the different dependencies, that you can have. Yeah. And as a rough first approach. Yeah. That might be good enough for you to come up with a benefit for your model. Yeah. [00:10:00] But of course there’s also a cost to it. Always. Yeah. So there’s never a free lunch.
[00:10:07] Paolo: Yeah. But I was wondering what is the cost of this approach. Because it looks awesome.
[00:10:14] Alexander: Yeah, it, at first glance, you think well, add in everything. Yeah. But of course, that makes your model very clunky and much more difficult to interpret. Yeah. The beauty of this linear model is that it’s really easy to communicate.
[00:10:34] Very easy to explain. Yeah. Because for one unit, more on this covariate, you have, oh, no better units. More on the outcome. Yeah. And so it’s a really simple thing, to describe.
[00:10:51] Paolo: Oh yeah. Maybe this explains why fractional polynomials are mainly used in prediction and prognostic research maybe.
[00:11:01] Alexander: And that is what is the difference between protection and estimation. And something will. Dive into later in, in this series of linear regressions. But just for the first thing, think about it, you always need to pay a cost in terms of complexity, but actually also in terms of a couple of the statistical properties here. Yeah. So your model, you can think of it becomes more and less, more and more unstable.
[00:11:29] Some more terms you, you add. So have thought about this. One specific thing is to look into the logarithmic scale. Yeah. So that is something that sometimes is really interesting. So to have this linear regression very straightforward. Yeah. Mu plus better X. And instead of the X you.
[00:11:52] Lock of the X. That gives you a very different interpretation of things because then the better is not any more kind of honor formation thing. For one added unit, you get this increase, but basically, it means every time you double so well, depending on which lock you take. Yeah.
[00:12:19] 10 times higher if you take the easy decimal lock or if you take the binary locks and then it’s, every two times higher c cover. It’s the biggest, the outcome is. Yeah. So if you have these kinds of, more, these types of relationships Yeah. That says, in these kinds of exponential things, then that is something interesting, to have a look into.
[00:12:43] Paolo: Yeah. When maybe you have this exponential growth. Yeah. Also in the initial phase of an epidemic, you want to have some explanation on the doubling of the KCC or maybe when the KCC is, by 10 multiplied. And you can see, yeah, you can explain a bit more what’s happening in this outbreak.
[00:13:13] Alexander: Yeah. Or maybe you have a successful startup. Yeah. And that has the growth of the startup, the sales or something like this, is growing on an exponential. . At least that’s the beginning. Yeah. And time is your covariate and that is something to look into. Yeah. And you can also accordingly rescale your access accordingly.
[00:13:38] Yeah. So you see that sometimes in these dashboards on the pandemic there is, you don’t see a linear scale, but you see a lock. Yeah, and in the same way you could do that for your linear regression here, it’s sometimes this really simple kind of transformations of the parameters that can help you [00:14:00] to have a much better fit of your model.
[00:14:04] Okay. Let’s step a little bit further in terms of having a number of areas. Yeah. So we now talked only about having linear terms, but now if you look into this parent and child thing, you could say I, maybe I have the height of the mother, or I have the height of the. Yeah. So I have, instead of having, just one regression, you have either height of the mother or the height of the father.
[00:14:38] Yeah. And you basically have. Two variables that you need to take into account. The gender of the parent and the height of the parent, and now you have the height of the children and maybe you want to have, both as cool variants in your model. Now, what happens then?
[00:15:01] Paolo: Yeah, we have a multi-variable model and then of course maybe we can Correlated Covariates. Yeah. I think that this is the main issue I have when modeling stuff with many Covariates.
[00:15:22] Alexander: Yeah. Yeah. So that is some things where your model can break down. . Yeah. So imagine you have two core variants that are very highly correlated to each other. Then what happens is that these are one variable Yeah.
[00:15:43] Explains most of the outcome and then the other variable. Nearly, no beta, or with a few changes, it completely switches around. Yeah. So depending on your sample. Yeah. Yeah. Maybe
[00:15:59] Paolo: at the first site it can be that the second variable is negatively related to the outcome. Yeah. Positively related to, yeah. At least in your understanding.
[00:16:11] Alexander: Yeah. And so this is something for sure to check first Yeah. To look into the correlation of all the parameters that you want to look into. Yeah. So you can do that by using, for example, these core.
[00:16:27] Yeah, it’s a really nice way to look into the correlation of or at least one-to-one correlations of many endpoints or many variables at the same time. These cores are something that we can put some pictures and some code into the show notes. So just head over to the effective data scientists and then you’ll see what we are talking about.
[00:16:50] Very nice visual impression. The other thing is, of course, having an understanding of what all these different variables actually mean that you get in your data set is also really important because just on the understanding of the fields that you’re working in, you’ll already send No – Okay. That variable is highly related to this other variable, so yeah.
[00:17:17] Paolo: This makes me think about the regression modeling exercise in general. So it’s more about thinking about a possible explanation. So it’s impossible to just run different models and decide everything according to some statistical outputs.
[00:17:41] You need to have a clear understanding o of your business or of your problem. And then what’s important is what we need to explain because I think that when we have this problem, of correlated predictors, we have just one core, which is taken out. One of them. And when you sacrifice, one potential explanatory variable, one of the ideas could be related to your expertise or, subject matter expertise.
[00:18:24] Alexander: And if you don’t know exactly which to take out, then I have a discussion with your business partner that actually benefits from this progression model. Yeah. What they like to see in as an explanatory variable, A or B. Yeah. If A or B are highly correlated. So these are a couple of, practical things, to have a look into.
[00:18:48] But let’s go back to the very, very simple case that you have a binary covariate and a continuous covariate. Yeah, so the, what then basically happens is that you, in your scatter plot, you have two clouds in there. Yeah. One where you see all from the first category and one where you see from the second category of this binary covariate.
[00:19:17] And now there’s a couple of different things that you can do. First, you can assume that both variables work in influence independently. See the outcome.
[00:19:33] Paolo: Yeah. So there is no interaction
[00:19:34] Alexander: Yeah, there is no interaction. So irrespective of, both binary outcomes, the influence of the continuous outcome is the same.
[00:19:50] Paolo: So we have mo plus beta one X one plus beta two X two plus the error 10 X.
[00:19:56] Alexander: Exactly. And X two is there an indicator variable set as one four? So first group and zero for the other group? Yeah. Yeah. So always you could also rewrite it so that you have new. Better, one x one plus better. Two for the first group, or plus zero for the other group.
[00:20:21] Yeah, so it’s in, in kind of these very simple terms. So what then is the interpretation? Is that, the beta tool, the second term for which is in front of the binary, covert is just the difference between the two lines on the vertical. Yeah, so in your scatter plot you would see two lines being in parallel and the difference between that is the effect of the slope, so to say, for the other for the spine variable, because there is a unit is just, we made it to one
[00:20:59] Paolo: so we have the same slope, but different intercepts.
[00:21:03] Alexander: Exactly. That assumes that there’s no common, no, no interactions. What we say between these two different covariates is, but now that is one thing that might happen. But what, how do we deal with personal interaction and what does that actually mean? Yeah.
[00:21:21] Paolo: I think that it makes total sense to have some interaction between different variables.
[00:21:30] Of course, let’s think about a business problem. Maybe you want to invest in Facebook advertising and then you want to invest in Twitter advertising. Yeah. And maybe investing on both social networks can have an interaction between. Popularity is growing. Maybe there is some group of people looking at [00:22:00] body social networks and the chance of your visibility increase.
[00:22:06] So yeah, so maybe you can have this interaction or maybe think about medical problems. , maybe you have one treatment, maybe you have one biomarker. Yeah. Which is, I dunno, one gene or one physical characteristic like BMI. And maybe the treatment works very well for your outcome, but especially well if you are maybe not obese.
[00:22:41] Alexander: Yep. So coming back to the Twitter and Facebook examples. Yeah. So what you might be interested in is if you invest, $100 on Facebook ads, yeah. You get $80 back. No, you should get more back. So $180 back? Yeah. And if you invest in Twitter, you get $200 back. Yeah. So your regression coefficient actually depends on your binary variable.
[00:23:16] So Facebook versus Twitter. Yeah. And so here are the interactions. Actually the interesting thing, yeah. It’s not so much about the average, it’s really about the interaction that is interesting. Now the interaction is something really weird thing. Yeah. Because if you think about it, these, if there’s an interaction that means.
[00:23:42] Two lines cross each other. Yeah, at some point. And now it’s the important thing here, so it basically means that depending on how much you invest, Yeah, you get a differential outcome. Yeah, of course. For every money, you invest more. Yeah, you get more returns from Twitter and less from Facebook.
[00:24:11] But that doesn’t mean that you should invest in Twitter because it depends on where your investment will sit. Yeah, you can think about it.
[00:24:24] Paolo: Before or after the crossing point.
[00:24:28] Alexander: Exactly. Yeah. So that is an interesting piece. Yeah. Where are all the data that you have? Yeah. Is that all before the crossing point or after the crossing point?
[00:24:40] Yeah. Or think about your scatter plot. Yeah. Has your scattered? where you have all your data samples and then you have the two lines in there? Yeah. And where you have your data, maybe always Facebook is higher than Twitter. [00:25:00] Yeah. Although the slope is not as steep. Yep. Yeah. And then of course you have only data that you know, on average it’ll always be to invest in Facebook, although the incremental investment is a much steeper curve. For Twitter, yeah. Or it could be the other way around. It’s both, Twitter has a higher curve but, or a steeper curve and it sits on, always on top of, say, Facebook, it could be also that kind of your data’s all scattered around.
[00:25:37] Crossing point, and so it really depends on where your investment is. Yeah. Yeah. So that is something to have a look into. Here we also say, very often speak about qualitative versus quantitative interactions. Yeah. So when you are crossing, the point is in the middle of your data, in the middle of the area.
[00:26:01] That is interesting for you. We speak about qualitative. Yeah. Because really it depends on whether you are right or left of this crossing point. It gives you, changes your decision. Yeah. So it’s not all about what you get. Yeah. But it really changes your decision and it’s, on the one hand, it’s Facebook is better than Twitter, and on the other hand, Twitter is better than Facebook. Yeah. So it really just turns around. Whereas all, if all your data points and your area of interest is, on one side of the thing of the crossing point. Then you speak about a quantitative interaction.
[00:26:45] So of course there is a relationship, but it only affects the size, not the direction. Yeah, not the direction groups. Yeah. So have a look [00:27:00] into this and that gives you a much better understanding of what’s actually going. Yeah, so that is some further first adding a little bit more complexity to the linear model that we discussed about in, in this episode.
[00:27:18] And there is, just with adding one further variable, there are already lots of interesting things happening. If you loved this episode, then tell your colleagues about it and stay tuned for next week’s episode where we will further talk about linear regression and other things.
[00:27:41] It’s a hugely important topic, it’s one of the. Key tools that everybody should know about and it’s, this hammer in your toolkit that helps you with lots of different problems.