Cluster Analysis (Episode 22)
Welcome to the Effective Data Scientists podcast. The podcast is designed to help you improve your skills, stay focused, manage successful projects, and have fun at work. Be an effective data scientist now.
So, how are you doing Paolo? I’m doing very well Alexander. How about you? Very, very good. I always love these recordings and it’s cool to talk about this pretty technical topic. And it’s one of these topics that starts pretty simple but as you go more into the details
Yes, devil is in the details, as usual. So let’s start with a more simple one. So when you think about cluster analysis, what comes to mind for you? What was the kind of, what are typical things you look for when you think about cluster analysis? First of all, when I think about cluster analysis, I think about…
continuous data and how to group together different observations, which may be similar according to different variables. And maybe this tool can help me to interpret what’s going on in a
different way when I look at different variables? Yeah, for me it’s the same. I think about continuous data. I think about many variables, usually larger sample sizes, and trying to see are there kind of groups, or how we call them, clusters, that are somehow similar.
And it’s really trying to find patterns in the data. One of the important things is this is really a
a technique under the umbrella term of unsupervised learning. So it’s not that we look for the relationship to an outcome of interest. Yeah, so for example, response or some kind of variables that interests us, but we have, yeah, just a number of variables that describe
or subjects and we’re interested in understanding, okay, how are other groups within these subjects that are somehow more similar. So let’s go into what actually means similar. Yes, it’s about measuring distances between object responses.
individuals according to different variables and find a good metric for measuring this similarity. That could be, for example, the Euclidean distance or in terms of the and of course there are many ways of doing clustering.
K means clustering approach, which is a method for finding clusters and cluster centers in a set of data without labels. So you start with the desired number of cluster and you assign each unit individual cluster.
to each cluster, minimizing the variance of the.
differences within a cluster. And then of course, this new cluster will have a new center and use restart once again, computing the shorter distance between each unit and.
this center again and you will end up with a new set of means which can be defined as the new center for the cluster and so on. You apply this iterative procedure until you optimize everything and you have your desired
well separated. And of course, I think the tricky part is understanding that you will always find clusters if you start by doing so. If you start this algorithm of chemist clustering, for example, there’s no way you have the…
As was from the algorithm that there’s no clustering. Okay, so you start with your number in mind and you end up with the best solution for the specified number of clusters. Yeah, yeah. You need to take into account that while you are performing such kind of exploratory techniques. Yeah, you basically start with.
If you have 1,000 subjects, you start with 1,000 clusters. And then in the first step, you combine the two most similar objects into one cluster. So then you have 999 clusters. And then you do it again and again and again until you basically have the number of clusters you want or until you have just one cluster where everybody is together.
And I think that directly shows a really, really important problem within statistics that we always have. You can go to both extremes. You can go to, oh, we want as many clusters as possible. And then you have a lot of clusters. And within the clusters, the objects are very, very similar.
Because it has very small clusters. But then any descriptions that you want to do within cluster will depend only on very, very few subjects. So you have very homogeneous clusters. But any description about these will be very unprecise. So you have minimized your bias.
But yeah, it’s a cost of a lot of imprecision. Or the other way, you say, I want to have fewer clusters. And then you allow for much more heterogeneity, or in other words, bias, within the clusters. But at least now you have sufficient sample size to gain some precision. And
That is, so to say, the art here, to get to this right balance in terms of clusters and size within the clusters. That’s why I also think like you need to have a reasonable number of subjects here. If you’re dealing just with a couple of dozens of subjects,
probably gets a little bit difficult. So, yeah, that’s a little bit.
And of course, there are many ways. So when we talk about Euclidean distances, for example, we are in the nearest-neighbor framework in terms of algorithm. Then you have the k-means clustering.
And then you have, for example, Gaussian mixtures in which you assume that for each cluster, you have specific variance, covariance, and means, and defining the distribution according to the variables you have. And for each cluster, you have a.
set of means and variance covariance parameters relating to these multiple variables. And of course you can define with different algorithms, different clusters, and with each algorithm you may achieve a specific goal.
and product will be quite different. So, and the same is for the, of course, you are in the unsupervised learning framework. So, you are still in the statistical learning approach in which you need to optimize everything for.
optimizing the bias variance trade-off. So you have all the algorithms for doing cross-validation, for example, measuring training error and validation error. For me, the most interesting part,
comes at the end when you solve the algorithmic part, finally you have the clusters and you need to interpret them and labeling them. It’s just kind of post-doc exercise. Yeah, yeah, so because your cluster algorithm just tells you, so this is cluster A, this is cluster B, this is cluster C, this is cluster D.
But these are not really the helpful labels. So now it’s interesting to look into what are the characteristics of these clusters. What I usually do is looking into all the variables that I use to create these clusters and visualize them. So that I can see, okay, where are the differences?
between the clusters. What are unique characteristics here? Let’s say if I have, I don’t know, let’s say five continuous variables that I use to build the clusters, I would do some kind of line charts where I have the different clusters, other different lines across these five different variables.
And then I can see, oh, OK, here this cluster is very high on this endpoint, that variable, and all the others are very low. Or maybe this cluster is continuously high on all endpoints, whereas all the others are much lower. And then you can see, oh, OK, you can label this. Yeah.
This is the high value cluster or whatever. Yeah, we can give them values according to the variables that describes them. Something that is understandable to people that works in with the data. To physicians, to business people, whoever you’re talking to, that they can understand. Okay, that’s it.
Yes, I think it’s quite tricky to labeling them after this exercise. And I found quite useful to maybe reduce a bit the number of bastards, usually, because this makes easier to communicate with the users.
the analysis. But yeah, I think that it’s a nice exercise. But I think it’s hard to find the problems where it could be really, really useful. I don’t know if it’s the same. I don’t know. I think that is
I think there’s many, many kind of ways we can look into this. One of the common things that I have seen is you want to, for example, understand whether there are certain unidentified subgroups in patients. Yeah. That’s my kind of coming from the medical background, my kind of number one.
area, you have, let’s say, a questionnaire with 10 different questions in it, and you apply it to a group of patients. And are there kind of, you know, specific phenotypes, specific groups within the patients that are very similar and very, yeah, kind of form really clusters? Yeah.
patients that have a specific set of symptoms that are different to others. So for example, I’ve worked in psychiatry quite a lot. And if you think about depression, it’s a very, very heterogeneous population. You have certain patients that have
are very, very different from others. Maybe these are the patients that have a lot of physical symptoms, like pain and so on, and they describe their depression mostly through these physical symptoms. Then you have others that only speak about mood and things like that. Then you have patients that are…
They have very different coping strategies with depression. What do they do about it?
And then you can understand, okay, how are these different? You may see like, maybe there’s a correlation with gender or with geography, culture. These kinds of things may have an impact on these phenotypes. And understanding them, these different phenotypes is quite useful for physicians to kind of think, if the…
patient comes with, you know, this unexplained pain, maybe that has something to do with depression and not a, you know, physical background, but much more a mental background. And so these kind of things help quite a lot for people to understand it. Another thing is if you look into clients, for example.
If you want to understand certain subgroups within your clients, yeah, a client said maybe values certain things quite a lot, yeah. And you want to then focus your marketing efforts on these clients, yeah. Maybe there’s clients where the typical thing is kind of quality versus cost.
or speed or other kind of things for your service, for your product. What’s most important? And some clients will value this quite a lot. And you can, I’ve seen that in the medical space as well, where there was understanding, okay, you have these different products and they have different features.
What are the physicians that will mostly attract it by your features? What do they value? How do they look like? Yeah. Where can you find them? What are the best channels to reach them? All these kinds of different questions. And so whenever you want to break a big heterogeneous group into something that is more manageable.
where you can become more targeted because you are understanding a subgroup much better. And then that’s kind of where I think these kind of algorithms play a big role. In some sense it’s like solving a non-linear, non-regular problem in the sense that it’s
optimize everything. Usually you don’t have, for example, a big cost benefit function for your market. Maybe you have subspaces, clusters, and you can target your efforts to each cluster, or maybe
discard the one cluster if you want, depending on the market, because it’s not working to address its needs, for example. Yeah, yeah, this is kind of common. It’s a typical word. If you talk to everybody, you talk to nobody. So if you want to specifically address
understand the different subgroups in your audience. And your audience could be kind of your target clients, or your patients, or whoever. If you can clearly identify kind of subgroups, you can much more kind of customise, specialise, focus.
all your efforts on these because you better understand where they’re coming from, what their needs are. And of course, you need to then know how do they look like. Here comes in the next part. Once you have these clusters, you can look into all kind of other variables which you haven’t used for defining your clusters.
maybe things like age or sex or geography or whatever, you know, all kind of other variables. And you can think like, see, okay, I’ve defined these clusters based on just questionnaire survey whatsoever. Now let’s look what else if you know, other variables said, you know, might have an influence on that. And then of course, you step into
supervised learning, where your clusters are basically your defining variables and you want to understand which are predictors for those clusters. So, for example, going back to the depression example, those depressed patients that mostly express physical symptoms are these more male patients, are these younger or older patients.
Are these patients that have a different socioeconomic background? Are these patients that come from different cultures? All these kind of different things. Do they have a different kind of understanding of themselves? You can look into many, many different other variables that then kind of signal for you. This is a middle-aged man.
That might be a kind of candidate for this type of depression. I’m just making that up. I’m not a psychiatrist, so I can kind of directly speak to this. But just as an example. Yeah, and I’m wondering, so in terms of algorithms and measures,
So we’re always speaking of something now which is useful for continuous variables, right? Because we have spoken about Euclidean distances or in terms of…
mixture-based clustering, we are defining means and variances, which are quite useful for continuous variables, maybe not for other kinds of variables. So what we can do if we have also non-continuous variables.
or that categorical variable, then using ranks is quite helpful. Yeah. You could potentially all your variables just use the ranks there. And by doing this, you have basically harmonized everything. Yeah. So ranking of course, only within the variables. Yeah. Not, not across the variables. That way you have.
for all the different variables you have kind of harmonized your variability and things like this and you now more look into is it bigger or is it smaller. You don’t look into how much bigger is it. That of course helps also with a couple of other things. It helps with the problems that you might have.
Even if you have only continuous endpoints, you might have very, very different variability. Just to explain that in a very simple way, imagine you have just two variables. And so you’re two continuous variables. And so if you think about clusters, you basically think about a scatterplot with these two variables.
and you think about, okay, where are, you know, really these kind of clusters of patients, where are the dots in your scatterplot, where do they come together? And imagine you have the scatterplot set is just, you know, one centimeter high and, you know, very, very wide, yeah? Then
pretty much everything will be just determined by the difference on the x-axis and not the y-axis. So if you want to identify similarity, you need to somehow standardize your variables. So that you have very, very similar variability across the different things.
of consent also depends a little bit on which metric you use to define similarity. At least, you know, make sure that there’s not a huge kind of difference in terms of variability across the different characteristics you’re basing your cluster analysis upon. The next topic is if you have
categorical variables. So basically 01 variables or ABC or blue, green and yellow. You can use dummy variables here and use the typical what’s called reference coding to define your dummy variables. So if you have, let’s say, outcomes blue, yellow and green.
then you have instead of one variable that says blue, yellow, or green, you have two variables that says it’s blue or not, it’s green or not, and if both are not then of course it’s yellow. So that way you have instead of one variable you have two variables and you have says
can use your algorithm here again. Another interesting thing is weighting variables. As you have seen with the example of scatterplot, those variables that has a big variance get a lot of weight. Now you can also think about, do you wanna…
weight your variables. Do you want to give certain variables more variability? So that can for example happen if you have, let’s say you have a questionnaire and in this questionnaire you have three questions that more or less go about the same thing and for something else you have just one question. Maybe then you want to down weight.
you know, it’s the things that are about just these three questions, because that all goes for the same concept. And if you have three answers, the other one you have just one, that basically gets three times the weight. And also, maybe measure more or less the same thing. So have a look into these kind of things as well. Are there certain features that you want to give more weight?
to define your clusters. So that was quite a lot of discussion about clusters. And I think it’s more of an intro episode. There will be more things that are coming about clusters. And yeah, look out for more about this, because it’s pretty nice and interesting area.
of unsupervised learning. That would be exciting. Polo, what’s your key takeaway for all our listeners? I think that it’s important to start from the basics and trying to build on the basic understanding of your data and doing this kind of model.
models in, I mean, with clear purpose in mind, with a clear purpose in mind. And then you can explore many, many ways to do it. We will see in the future. There are a lot of algorithms, but it’s important to, I mean, start with your why and
being curious about data. Yeah, yeah, completely agree. And when I first did it, I tried lots of lots of different things. Just because, you know, the first approach didn’t make any sense. So don’t don’t be discouraged if you try these kind of things and don’t directly give any reasonable results.
As we said at the beginning, the devil is in the details. Thanks so much, Paulo. Thanks. Bye.