Skip to content

Dimension Reduction with PCA (Episode 21)


Dimension Reduction with PCA (Episode 21)


Welcome to the Effective Data Scientists podcast. The podcast is designed to help you improve your skills, stay focused, manage successful projects, and have fun at work. Be an effective data scientist now.


Welcome to a new episode of the Effective Data Scientist. Again, it’s Paolo and myself. Today we are talking about interesting things that very often comes across in especially big data sets. Paolo, what’s the topic for us today? It’s principal component analysis with emphasis on


dimensionality reduction. Yeah. So, what is a typical problem that you might face? Can you give an example, Paolo? Yeah, for example, if you fit a model, but in the common statistical framework or more in the statistical learning, machine learning framework, you can have lots and lots of


variables in your covariate set. So for example you have many features and you can have many correlated features, features in different for example ranges stuff like that. So this is mainly the problem of the course of


dimensionality. So in general if you fit interactions, main effect, stuff like that, then you need really lots and lots of data to have enough data points in each interval of the covariates and this could cause troubles in


both fit in the model and its interpretation. So usually we need some tool to reduce this dimensionality and often also because many of these variables are really correlated and belong to the same dimension. If you want to say.


For example, if you have a survey, and you have lots of different items, you have 50 items maybe in the survey that ask for all kinds of different things, it’s pretty impractical to always look into 50 items. Usually, researchers have some underlying concepts that they want to look into.


And these different items of the survey all contribute to these different, let’s say, topics to different extent. Some might contribute very much to a certain dimension, some others less. Just thinking about, for example, I’ve worked a lot on ADHD.


And there was one questionnaire with 18 questions about the different symptoms of ADHD. And some were really asking about the hyperactivity part of it, and others were more talking about the inattention problems. And so it made a lot of sense to look into kind of, can we, you know,


reduce the dimensionality of 18 items into maybe two dimensions. One speaking about hyperactivity and one speaking towards inattention. Right. And in terms of the questionnaire, this tool PCA has been used mainly for interpreting the questionnaires in the sense that you need to understand if your


Questionnaire responses speak about a unidimensional latent trait of the individuals or it can be interpreted with different dimensions, like, I don’t know, it could be activities of daily living or emotions, for example, to different dimensions, which are quite…


related to the underlying dimension of the general health, but they have also their interpretation and specificities. And here this HOLD tool could be helpful because in general what we need is to have more


independence on the covariates, right? Because we already know that multicollinearity is a big problem for 15 Hauer models and one of the advantage of using this tool is to make Hauer covariates space different. So we have a


an orthogonal projection into a lower dimensional space in which the covariates are more independent, like different dimensions of your questionnaires. And this can help a lot in the model fitting. It’s really about


So you can use this kind of matrix decomposition techniques, like, you know, eigen decomposition or singular value of the composition to have independent dimensions. So you have a kind of simplified and more powerful information to predict your final outcome. Yeah.


Yeah, exactly. If you think about it in a, let’s say, very, very simple way, the most simple way, let’s say you have only two variables, and you have x and y, and you have a kind of a, think about a scatterplot with these two variables.


And there’s this cloud that sits on the scatterplot. And this cloud has this typical, let’s say, Gaussian distribution. But the Gaussian distribution looks in such a way that it’s oval. And it’s kind of.


pointing to the, let’s say, upper right. So as you obviously see, there’s some correlations, a bigger X becomes, a bigger Y becomes and vice versa. And now the principal components basically look into this correlated space and check how can we find two lines that…


Yeah, that makes the cloud. Variants in the best way. Yeah.


Yeah, basically you rotate the dimensions and you make the final variable spaces orthogonal. So you have this independence given by the procedure. Yeah. So in the end, you will…


try different the composition you can see how things work with different dimensions and for each dimensions you have some you know correlation with the original variables so you can interpret the these dimensions


looking at the original variables. For example, weight are correlated with the underlying dimension. This means that this dimension speaks about how big is the individual overall. Yeah, exactly. Or going back to the ADHD example, you will then see that


one dimension really corresponds and correlates with the items that are about attention and the other one that is about hyperactivity. This makes it helpful to interpret these new variables that you are creating. So what do you do then as a next step here?


You fit this model of dimensional reductions with PCA, so in the end you can extract these final dimensions. So for example, you start with one questionnaire, 15 items. So you may be tempted to use.


the final score or the single items in your model, but you can may use these two dimensions, for example. So you extract these sub-dimensions from your data and you can use them in your model, in your regression model, for example, to make things more efficient. Of course,


This is more about predicting things than interpreting things, but it can give you some advantages also when you try to interpret things. But it’s mainly used for solving issues in fitting models for big data, for example.


Yeah, yeah, that is really helpful. So in the end, instead of putting your 15 variables into your regression model as covariates, you only put two or three into it. Basically, it’s the outcomes of your PCA. And with that, you, of course, throw away a little bit of the variability. But you also have a.


no or less co-linearity anymore. And that’s a big gain that makes things much more stable and easy to work with. But of course, as you said, it’s then really about prediction, not so much about interpretability. If it’s about interpretability, then you may think, OK, I look into just those.


variables that really kind of have a big influence on these different dimensions. And I put them together as new scores and just add them up, so to say. Or maybe weighted adding up. And anything that has a really low weight, you know, you take out. So everything with a…


there’s a correlation to see new dimension is really, really weak, you just take some out. Yeah, in terms of interpretation, for example, you can say that your why is correlated positively with this dimension, like, I don’t know, humor or moods or other kind of…


dimensions but of course it doesn’t make any sense to say that you have a 10% increase on your y according to your latent dimension one for example but it can give you some general insights but of course you lose the clear interpretation of the coefficients like


using the original variables. Yeah, but you get a better kind of sense of kind of, okay, here humor, for example, plays an important role, or here mood plays an important role, or any other kind of aspect, whatever you kind of got out as key dimensions. Yeah, and if you think about where you started with a lot of covariates.


So many of them correlated. So in the end, if you fit the standard model, the interpretability becomes a nightmare. So in terms of a whole role interpretation, maybe also PCA is good for you. Yeah, yeah, yeah. And it’s really good to kind of, if you wanna do this, yeah, start with looking into the correlation of all these.


and all these different items. What I really love there are these programs. So it’s basically a matrix where you look into the correlation of each item versus each other item or each covariate versus each other covariate and see within this matrix you basically


see, for example, pie charts that show you how strong the correlation is. And they are sometimes colored in red or blue depending on which direction the correlation is. And that gives you directly a nice way to understand it. And if you even then put some clustering on top of these core grams,


then you directly see, oh, here’s a bunch of variables that is highly correlated, there’s a bunch of variables that is highly correlated, and there’s another one. And that might be another kind of in-between step that you can do to also understand your correlation and it guides you already in terms of these PCA analysis. Yeah, and again,


I think that introduction to statistical learning, we will refer many times to this book, is a wonderful reference for starting with the theory but also practicing a lot with the hard code. And of course, there is nothing, you know, for free. You need to experiment a bit.


plotting, trying different things with your PCA and also the final number of dimensions, your final scores you will use in your final model is not so easy to select. So it’s a bit guided by also


common sense and experience. Yeah. But it’s a lot useful. Yes, you can have a look into how much variability do you explain, yeah? And with sometimes you see kind of really if I add this, you know, if I step from three to four explanatory variables, yeah.


It doesn’t make a lot of sense. So the gain that you get in additional variability is not worth it. So that’s how you can have a look into this. Where’s the biggest gains that you get? Is it from one to two? Is it from two to three? And where does it kind of plateau out and it really doesn’t make a lot of sense to…


further describes the variability because most of the variability is described with, you know, maybe just a few dimensions. Yeah, you have this simple bar plot showing how much variance to explain with one, two, three, or four dimensions, for example. And maybe you end up having…


75% of the variability explained by two dimensions and it’s enough because if you add the third dimension it is only a tiny increase in terms of variance explained like two percent so it’s not it’s not worth it. Yeah yeah yeah keep things simple. Awesome yeah so check out the show notes there will be lots of stuff there where you can play around with.


And it will be quite a nice way to dig into such data. It’s really nice to play with it, to visualize it, to understand it, to try it with different data sets. So sometimes surprises that come up when you look into such types of data. And it’s really.


really kind of a neat way to understand big data. And yeah, big data, especially in terms of you have lots of different variables that you look at. Right. See you next week or listen to us next week. Bye