In this episode, we move from the logistic regression model to the proportional odds model, with emphasis on
interpretation and the checking of assumptions (visually and analytically). We also speak about the
opportunities and challenges of dealing with the dichotomization of ordinal or continuous variables.
Resources:
- McCullagh, Peter, and John A. Nelder. Generalized linear models. Routledge, 1983.
- Agresti, Alan. Categorical data analysis. John Wiley & Sons, 2003.
- Faraway, Julian J. Extending the linear model with R: generalized linear, mixed effects, and nonparametric regression models. Chapman and Hall/CRC, 2016. [http://https//julianfaraway.github.io/faraway/ELM/]
Transcript
Episode 010_Dichotomization and Proportional Odds Model
[00:00:00] Alexander: Welcome to another episode and we are continuing our discussions about logistic regressions, linear regressions, and some further thoughts about that. Paolo, great to have you again.
[00:00:37] Paolo: Thanks, Alexander. I’m very excited.
[00:00:40] Alexander: Always fun to talk about these things and that probably speaks to our nerdiness, that we love to talk about these things.
Last time we talked about logistic regression and that works. If your endpoint you’re looking into, see, your variable of interest is a binary variable. And now sometimes what people doing is because they want to do this logistic percussion is you have a continuous outcome and then they digitalize it. Yeah. So let’s say you have a variable that is don’t know. Income or something like this. Yeah. And then you have a distribution of your income and maybe you want to look into all those people that have income versus, so all those people that have no income. Yeah. So…
[00:01:33] Paolo: That’s really easy.
[00:01:34] Alexander: That’s very easy. Digitalize zero or bigger than zero. Yeah. Or maybe you wanna say below, let’s say, Some kind of threshold that is set by society in terms of who’s called on or who’s culture or Yeah.
[00:01:52] Paolo: Some kind of low income category. Yeah. Which is not so easy, but still Feasible.
[00:02:00] Alexander: And so that is something that very often is done. Now, what are your thoughts from a statistical point of view about that Paolo?
[00:02:08] Paolo: Yeah I think that in general it’s quite risky to dichotomize something which is not binary in its nature. And sometimes I did see really bad things with dichotomization. For example, you have a continuous covariant for example, instead of a continuous outcome, and you wanna show that this covariant affect the outcome. And then if you analyze the covariant as continuous, you don’t get enough results. And then you look for a particular threshold. So it’s quite risky and this is one situation. Then you have also other situations in which you can or you are, you’re maybe constrained to do some classification of people, for example, in terms of political attitude. Or maybe it’s not left or right, maybe it’s about something like being very liberal, slightly liberal moderate, slightly conservative, very conservative. So maybe you have more categories for the same concept.
[00:03:14] Alexander: And yeah. I think also you might lose certain relationships in there. Yeah. Maybe the, there’s not a, it’s not a straight line, but some kind of curve, things like that you might not really pick up if you digitalize it. So before digitalizing anything, I always encourage people to look into it. From a non digitalized way and see also explore it visually, look into, scatter plots, things like that to get a better feeling for the data and whether it really makes sense to digitalize it. Yeah, so that is, and of course, kind of digitalization when someone else then wants a different threshold. Yeah, you need to redo everything. And of course, lots of these thresholds are arbitrary and so it’s then, yeah, always a problem that maybe that people, don’t agree with it or whatever. And that can lead to all kind of different problems.
[00:04:15] Paolo: And also for not, this is not too only for dichotomizing continuous variables. You can have maybe different classifications for the same concept. The categories for BMI you have older weight people, and then you have obese. But then maybe within the obese category, you could have also very obese. Yeah, people and so sometimes you just use different classifications because it’s convenient for you, or sometimes you don’t see that your sample sizer is really small, for example. And your you are trying to analyze six categories for 20 people. This doesn’t make any sense in general. So you should bear mind that you lose something when you dichotomize continuous variable or you cut the data in many categories. Yeah. Having no sense. Yeah.
[00:05:08] Alexander: Yeah. And maybe you can even lead into this, problems with quality separation and empty cells. Yeah. So if you think about it, you if look into it and you categorize something and then you have in your two by two table. Yeah, there’s empty cells in the top right and bottom left corner, and then you have complete separation. And then,
[00:05:33] Paolo: you have bad messages from your computer.
[00:05:36] Alexander: Yeah.
[00:05:38] Paolo: You encounter separation and Yeah. But it’s basic concept. Maybe you can struggle a bit. So when you start on doing this kind of logistic progression or other kind of progression for classification variable.
[00:05:51] Alexander: Yeah. Then of course you have perfect prediction and
[00:05:55] Paolo: which is not perfect, maybe.
[00:05:57] Alexander: Yeah. Which is actually in fact not perfect because just, you don’t have enough data and that’s the actual problem behind it. Yeah. Dichotomization is one thing. Sometimes people also wanna divide it into mores than just one category. Yeah. Like you just said, maybe it’s underweight normal weight. Overweight, obese. Yeah. And then you have four different categories and these are actually ordered. So it’s not just, you can’t say this is green, blue, red, and black, or whatsoever, which is not ordered. But it’s, there’s some natural ordering in it and if you wanna retain that ordering the problem is you can’t do the logistic regression because they’re you need a binary, not something with four outcomes. Yeah. You can’t do the, really, the linear regression because the difference between, normal and overweight and obese might not be the same. Yeah. And so what do you do in these kind of situations?
[00:07:03] Paolo: You could do the proportional odds model. Which is basically if you think about it, it’s like applying logistic progression at each step of the changing in your variable. So you have many categories. You have this step between one category and the following. Then you can model the logic of the probability of being in category two, for example, as compared to the probability in being in category one. And then you move forward and then you have the probability in category three vessels being in one or two. So it’s just moving into the next step and I think that what is really tricky because you, and you assume that the change in each step is modeled by the same beta. So the probability of moving in the next category is the same of switching in the above category to the other again, and you end up with the same beta, but with different intercepts.
Yeah, different alpha. But yeah, in the end you, it’s of course it’s something you are assuming here. It’s it’s not for free because this assumption that everything is going proportionally. So this is not really maybe how things work in nature and in society.
[00:08:36] Alexander: Yeah. Just to Get from there. So imagine you have an outcome that is zero, one or two. Yeah. And you would apply a logistic regression. Then you can say we model the probability for zero versus a probability of one and two. Yeah. And have a logistic regressions there. You get two co provisions, the alpha and the beta versus first model, or you do a logistic regulation for zero and one versus having a two, and then you get another alpha and beta. So now you end up with two alphas and two betas. And the what the proportional odds model is then doing is that it keeps the two alphas, but it forces the beaters to be the same.
[00:09:28] Paolo: Yeah.
[00:09:29] Alexander: And that basically means that the odds ratios Yeah. Are independent of which threshold you’re looking into, whether you cut between the zero and the one, also one and the two.
[00:09:43] Paolo: And we have one odd ratio here.
[00:09:46] Alexander: Yeah. You just have one odd ratio here. And that is the basic, underlying thinking is there that you have this. You’re cutting this latent variable. Yeah. And here in the, with the obesity example, we know this latent variable. It’s, just a weight and kilogram, but very often we can’t observe it. Yeah. And it’s more like with the approach that Paolo mentioned with this political attitude, very conservative, moderate, conservative. Moderate liberal. Very liberal or liberal. And very liberal. Yeah. And there’s probably some kind of underlying latent variable, which you can’t really observe and you can only observe these kind of categories and that’s a salt behind it and that makes it ,Yeah, actually a very nice and powerful tool for these audit categorical variables.
[00:10:41] Paolo: Yeah.
[00:10:42] Alexander: So the, one of the problems with it is if you have not a lot of observations in some of these categories. Yeah. But the nice thing is then you can combine them. Yeah. Let’s say you have very few that are very conservative. Yeah. Then you just combine them with conservatives. Yeah.
[00:11:04] Paolo: Yeah. I came across this situation very recently We were doing this kind of model for items of a questionnaire. So you have a 0 1, 2, 3. Okay. You have impact on quality of life from not at all to very impact on the quality of life, but we, we were quite naive. I think that we didn’t know the data very well. And when we did our model, we found that at, after the treatment, many patients resolved the problem with the quality of life. So we ended up with many zeros. Low a very low amount of observations in the other categories. So we ended up with just a simple logistic progression. So looking at that zero, that was one combined to two and three. So excluding all the other possibilities. So yeah it mainly depends on how much data do you have for the categories.
[00:12:05] Alexander: Yep. So that’s why it’s important to, look into your data beforehand. The other important thing is to check whether your proportional, proportionality assumption so that the details are the same across all these cut points is really true.
And of course you can very easily do that by for each of these cap points, run a logistic progression and see whether you know all the, all your betas more or less fall into the same place. Yeah. And or whether there’s, a strong trend in one or the other direction. And of course there’s also some clever goodness or fit statistics where, you can because Aus these lock betas are actually normally distributed and you can, do some kind of test of whether they’re, all the same or not. Like you can do with any normally distributed variables, just the, standard test. But the problem is this P-value doesn’t. Give you a lot. Yeah. So I am, I’m not a big fan of this kind of goodness or fit statistics. I really wanna see and get a feel for the data and whether it makes a big impact.
[00:13:18] Paolo: Yeah. Other ways to, for example, contrast your proportional load model with a multinomial model, for example. In which you don’t assume that the disproportionality or odds. So basically you contrast the results of these two different models and you can have an idea of the reasonability of your assumption. So yeah, it’s really something that goes proportionally from one category to the others.
[00:13:52] Alexander: Yep. And by the way, these alphas are of course, always sorted. Yeah. Because just by how it’s defined, the probability of being greater than two is always bigger than the probability of being greater than three. Of course. Yeah. That’s maybe something. Okay. Any final point about this?
[00:14:16] Paolo: Yeah, I think that we covered the whole, I think that you can play with the data, try something else, and have in mind that this are still really powerful tool. Yep. That you should have in your toolkit.
[00:14:31] Alexander: Yep. And of course, again, it’s again a linear model. You can again, apply all these different fancy things in terms of prediction and estimation and things like that, that we talked about earlier. And yeah, very often the real variable of interest here is the ratios and so se odd ratios are independent of these cut points. It’s a, even though if you have ordinal variable of five or six or seven categories. You always have just one ratio, which makes it really nice.
[00:15:06] Paolo: Really nice. I agreed.
[00:15:08] Alexander: Yeah. There’s maybe some side note. Alternative way to analyze this ordinal variable is with rank based methods. But we’ll cover that in the future episode.
[00:15:20] Paolo: Yeah, I know you are really keen on that. So happy to cover this in another episode.
[00:15:27] Alexander: So stay tuned.