We have outstanding news for you!
Thomas Neitmann joins The Effective Data Scientist as a co-host of the show! Thomas is an Associate Director of Data Science at Denali Therapeutics. He contributed significantly to the adoption of the R software in the pharmaceutical industry – especially during his time at Roche.
In this episode, Paolo and Thomas dive into the fundamental principles for a well-structured data science project. These include practical advice on:
- organizing files into folders,
- documenting and commenting code,
- using version control systems, and much more.
Although the episode focuses on applying these fundamental principles in R projects, you want to apply the same principles to any Data Science project, regardless of the language used.
Advanced R 2nd Ed. (http://adv-r.hadley.nz)
R for Data Science 2nd Ed. (http://(r4ds.hadley.nz)
Episode011_How to Effectively Structure Data Science Projects in R
[00:00:00] Paolo: Hi, Thomas. How are you doing? So I want to introduce yourself quickly. So we are today with Thomas Nietman who is now a co-host of the Effective Data Scientist Podcast. I’m really happy to record bunch of episodes with him and it’s nice to talk with him because he’s really experienced in data science programming in general, and mostly in our programming. But please, Thomas, could you please introduce yourself for our audience?
[00:00:37] Thomas: Sure. Thank you so much, Paolo. Yeah, first of all, I’m very delighted to be now a co-host of this podcast. It’s something I’ve been following for a while the work that you and also Alexander have been doing. So it’s exciting to be now on this side of it, so to speak.
So my journey in the industry started around four years ago when I joined a pharmaceutical industry working first of all at a CRO, a contract research organization. So providing services for other for large pharmaceutical companies. And then after a short period there, I had the pleasure of joining Roche in Basel, Switzerland, where I started off at what used to be called a statistical programmer.
Nowadays you call it data scientists with a general flow of things in the industry at large. And I spent around three and a half years there, and just recently I transitioned to a new role with a biotech company called Denali Therapeutics, where I’m an associate director for data science. And speaking of our as you mentioned, it’s somewhat of my passion and something I really enjoy doing and in pharma that, didn’t used to be so popular if I think a couple years back. But it definitely has gained a lot of traction and some of the work that we did at Roche really paved the way for that. And I’m really excited also that in my new role r will be a big focus. So we have a lot to talk about on that front.
[00:01:53] Paolo: Yeah. Thanks for this introduction, Thomas. And it’s nice to speak with you. You’re so young, but you achieved already a lot and of course a lot of maybe the popularity of our in pharma is probably due also to the work you did with your colleagues at Roche. It was amazing. All the webinars. So we listed great stuff.
So let’s start now with the first episode with you. Something I’d really want to talk about that it’s how to effectively structure data science projects in art, and probably not all in art because this could be a general episode. Everyone interested in programming will find great content for sure. So why is it important?
[00:02:46] Thomas: Yeah, I think there’s a couple of points to that, but probably one of the largest one is this theme of reproducible research. I think there already was a topic in previous episodes at The Effective Data Scientist. And really if you think about it, if you run any kind of data through a data processing and analysis pipeline, at the end of the day, you want to make sure that. If you hand over your code to someone else, they can not only scrutinize it, but actually run it and come to the same results. If you don’t manage to do that, then in a way you failed your job.
I would say as a data scientist and structuring your project is one part of that. There’s certainly a lots of other things to that, but. In the first place, if you don’t have a well-structured project that someone else or also you in the future look at and can understand what happened there you already failed on that front. So in that sense, it also becomes a way for you to keep your sanity while working on your project because when you go in, you see something that is well-structured. You maybe have set up certain folder structure, you have your data over there, your code over there, maybe some custom functions in another bucket. And if you use that consistently across projects as well regardless of what you’re working on, you always have that consistency and can easily come to the point where you can understand things. And also very importantly, because most of the time people don’t just work alone on that, they have collaborators and it really facilitates working a team if you have a structure that everyone can rely on, and it’s easily understandable. So this are just a couple of points why I think it is really important to structure your data science projects well.
[00:04:19] Paolo: This is really connected to reproducibility and probably also efficiency in how do you program, how do you create and build stuff for. Your company and we have a responsibility. This is really important both in industry and also in academia because I found many times asking code or data to colleagues from maybe other university, from other universities or researcher centers in the past. And they weren’t able to find a code or data anymore because, I dunno, statisticians left or for any other reason. And it’s a matter of integrity in what we do and and also for the value we created for our companies because someday we may left the company and people should still work on our previous work as programmers. So it’s really important. Are there some principles or is it just important to have a kind of structure?
[00:05:27] Thomas: Yeah, maybe. First of all, I think what you mentioned when you cannot even find the code or the data that’s led to a certain analysis, that’s obviously the worst case scenario. And hopefully it’s not that bad most of the time. But I can tell you also from my experience in academia, it’s certainly even if it’s somewhere there, that doesn’t mean that someone else can just readily pick it up because oftentimes it’s all over the place. There is some folder where some code lifts, then you have your data somewhere else and then maybe people hard code, certain file pass starting with their C drive and then their username on the Windows machine. So even if you have the code, it just doesn’t work if you put it on your computer. So there’s a lot of intricacies of what can go wrong there.
But if we talk about overarching principles, and maybe if we just stick on R for the time being, but later on, I think we can certainly generalize that. I think something that is invaluable is actually, if you start off with using an R project. So what does that even mean? So if you use an IDE such as our studio if you open that by default, you’re in your user home directory, and you can certainly create the scripts there, save them somewhere on your computer and a few data somewhere else.
But this principle of a project is that you say, no, there is a dedicated folder for this particular analysis that I’m performing. And inside that project, then you can have a bunch of subfolders where you, for example, store your data in a one folder. Then you have your main analysis scripts in the second folder and maybe your functions in a third folder. And that way everything part of the analysis is encapsulated within that single project. So if I compress that project, which is basically just a folder on your computer, send it over to you, you can unzip it and then you should be able to run the analysis because everything that you need should be self-contained within there.
And First of all, put everything in that project. And then the second point I already alluded to is, This kind of organization try to not just have everything at the top level, mixing your RS scripts with your data, with whatever else that you need with documentation. Instead, try to create a structure that makes sense. And certainly it helps to then also give very meaningful names to whatever you have there. So not only the folders, but also individual analysis scripts for our functions. Such that even if people just quickly scan through that, they get a feeling, what is this actually all about here? And maybe to continue on that because oftentimes when you start off a project, you just write a bunch of code and analysis. You obviously understand it at that point because it’s something you did, but even if you come back a couple months later, or even worse probably, if someone else looks at it and you don’t document the way that you’ve done things that’s really detrimental.
So try to also either document your code or use something illiterate programming tool like R marked on a Jupyter Notebook Python, where you intermingle a description of this is the things that I’m doing with the actual code that does it, such that kind of documentation and code actually become one unit and are not separate from each other. And maybe something related to our, again in particular is Try to set up your all your scripts in a way such that they do not rely on anything implicit. For example, you saved your workspace at the end of the previous session, and then you assume that some data frame that you need is already in the global environment.
Because as soon as I send it over to you, Paolo and you run that, you don’t have that in your global environment. So again, it will not work. So you need to make sure that if someone starts from blank state and runs a certain script, that it actually runs through from the top to the bottom and produces the results there. So these are just some principles. I would say there’s probably much more to mention, but I certainly think these are important ones to consider.
[00:09:14] Paolo: And, what about for example meaningful names? Because sometimes when I program I’d like to see clean code. And I’m trying to save space, for example. And also saving phones, shortening names I try to find abbreviations for all the objects that I have may, maybe it’s difficult to find a compromise and to have clean codes and have also meaningful names. So maybe covariant instead of coves or X, for example, is better. I dunno. Do you have some suggestions or do you think it’s not an issue?
[00:09:53] Thomas: Oh I think it’s definitely a great issue. There is this famous quote by a computer scientist, something along the lines of there are only two hard things in computer scientist science, and one of them is, Giving meaningful names to stuff. So I think that just goes to show that this is indeed a problem. And if you said the infamous X, right? What the hell is X? Could be anything depending on the context. So covariates then of course it’s something better, I would say, because it’s descriptive. If. Whatever you do is intended for like a data science statistician audience.
Then probably coughs is a good abbreviation. But maybe not C, just C. So there is certainly a balance you have to strike because if you give very descriptive names, then you could end up potentially with something very long, with a lots of underscores in between. And probably that’s too much. But then X, Y, Z I sjs these kinds of things. Or very, if you work on a Unix terminal and then you have commands like PWD, what does PWD if you know that it’s an abbreviation, pro print working directory, it makes sense. But if you use that for the first time, it’s I don’t, where does this come from? What is pwd? Keep your audience in mind and be mindful that abbreviations can work, but they also can make it very hard for other people to understand. So I would say when in doubt, air on the side of giving a bit more longer and quite explicit names.
[00:11:14] Paolo: Another topic is out to organize your functions. This could be our, could be python how many functions What scope for any function? So do we can write many functions with the small scope or having less functions maybe with broader scope. I don’t know. It’s something which is also really challenging for me sometimes. And also how to organize functions in scripts like one script per function or many functions within a script.
[00:11:47] Thomas: Yeah, again, a very good question. So I think first of all, the important thing is that, You should write functions. And then the second question is off the scope, but certainly I can remember when I worked in academia and people were using languages like R or MetLab or others, generally they had one very large script. So we’re talking several thousand lines, which does everything from data ingestion, processing analyses, creating some tables, graphs everything in one large block, nothing modularized at all. And I think that’s certainly something you want to avoid at all costs because again, first of all, that makes it very hard to understand what is going on there.
And also it’s hard to maintain in a way. So try to if there’s kind of some block of logic, try to encapsulate that maybe, first of all, just in a dedicated script. So maybe have one data ingestion script, one data processing script, and then one data analysis script. But then going further, also saying, okay, there’s some stuff here that fits together and maybe especially if it’s something that you need to do more than once inside your script, do put it in the function.
So definitely feel empowered to write your own functions. Put them in the dedicated folder and then at the top of your script, source them in such that they are available. Then on the question of, yeah, how broad should the scope of your functions be? I think It’s a tricky one to answer. In general. As I said, first of all, write functions. That’s the important point. And then I think there’s, we’re certainly people who argue that the kind of functions you should write should be narrow focus because then you can also test them which is important at some point, and I think we’ll talk about our packages in another episode. But for the, especially if you are getting started with this just try to write, 1, 2, 3, 4 functions, even if they have a fairly large scope at that point. I think that’s a good starting point. And then if you see they actually this function does two things then it’s probably a time to think, hey let’s separate it all into two functions, because I think that’s something you want to end up at the end of the day. That a function has a particular purpose and if you feel like, for example, it does both reading in your data and then processing it, then maybe it’s to say, Hey, actually that’s two separate two separate pieces of logic.
[00:14:00] Paolo: Yeah. And what about managing your projects with workflow tools? Do you have any experience or suggestions about that?
[00:14:09] Thomas: Yeah that’s an interesting one. And the larger your project gets, I think the more important that becomes. So if you have, a small analysis where you have one or two data sets, you have one script, which in the end generates a couple of plots. I don’t think you need a workflow tool, but I can remember when I interviewed some people at Roche and they had a bioinformatics background and they were presenting on the kinds of analysis they did. That’s quite elaborate. There’s like a lot of steps in between their different scripts, and you somehow then want to make sure that you organize your workflows such that you can easily run all things. And also what a lot of these workflow tools offer is to say, Even if you tell it to run everything, they will actually only run the parts of the analysis that changed since the last time you run the analysis. And that can be very important if you work with large data. So imagine you have to read in 10 gigabytes of raw data, then you process it and save it and then do some analysis.
So then if you change your analysis and you run everything again, you still again have to read the 10 gigabytes of raw data, even though they didn’t change. But then because only the analysis script changed, these workflow tools are often smart enough to figure out, hey, the raw data didn’t actually change. So we can just take the process data from there and just run the analysis again. So that also can save a huge amount of time in the development process because you’re not always forced to run everything from A to Z. Instead, you can maybe start at, somewhere in the middle of the alphabet and then run it through the end.
[00:15:37] Paolo: I guess that also using a version control system is pretty important.
[00:15:42] Thomas: Yeah, definitely. So the most popular one being Git and GitHub or GitLab, these kinds of hosting services. But regardless of what you use, I think it’s always good to. If you come to a state where you accomplish something new, save it in a way. Make a commit message. Say now I achieved X, Y, Z. And at that point you know it works now and then if try to change something and actually it doesn’t work, then you can always go back. I think that’s really important. And especially if you have a project where you have repetitive deliverables. So let’s say you are working in pharma and you are doing these annual safety reports, and then. Whenever you do them, you have a milestone in your repository. You can always go back to the exact state of the analysis at that point in time and can see what happened there. So overall version control I think is really important.
And also, again, when you Start working with collaborators, having something like GitLab or GitHub, a Git hosting service which is really build around the concept of, Hey, there’s multiple people working on this piece of code. That makes it so much easier because what do you do otherwise? You send around scripts in emails, and then you get to final script, one final script, one, two that’s not a place where you wanna be.
[00:16:53] Paolo: 1, 2, 3, underscore.
[00:16:56] Thomas: Yeah. And this is really final whatever.
[00:16:59] Paolo: Really final. And do you have a default structure in mind when you work in a project? Then of course you, you may want to adapt the structure, but maybe by default you have already a good project structure in mind.
[00:17:17] Thomas: Yeah, I think when you start Googling that you will find different flavors of that. But in general, something I would propose is. First of all, you should have something like a Read me file where you just say, what is this all about? Because again, think about future you and also your collaborators maybe even ones that are not within your company or academic institution just yet, but see this code two, three years down the road, they need something basically to enter into the project and understand what is going on there.
So first of all, have this kind of read file which documents things. Then in general, you have some kind of raw, unprocessed data. Which if it’s in files, I would put in a separate folder and call it something like raw data or data raw. Then you would probably have some scripts so maybe have a scripts folder and in general then one of the first scripts is reading in that raw data and doing some kind of processing before you actually do the analysis and that. You certainly want to save the process data then, so then you would have a second folder for process data. So right now we then have raw data, process data and your scripts folder. And then we talked about the importance of writing functions. And I find it often easier if you put the functions in a separate folder from the, because then it’s very clear to see this is functions. If you source these files, nothing will actually happen. It’s just a definition of the functions, whereas in the scripts folder, everything that you source in there that will actually execute some code and do logic. And then oftentimes you, at the end of the analysis, you have some kind of outputs, whether it’s just a model object that you want to save, whether it’s a table some kind of data visualizations. And therefore you should have something like an output folder, which saves the final results of your analysis.
[00:19:02] Paolo: Thanks. Very practical. Dave Thomas, and we’re getting to the end of the episode, are these principles generalizable to other software?
[00:19:13] Thomas: I would certainly say so, especially if you use one of these languages which are very popular for data science. For example, Python or Julia. Which I would say for the most part are very similar to our, so for example, something that we just talked about, the folder structure that is really nothing that is specific at all to a particular language. So you could just as well use that within your project where you use Julia or Python giving meaningful names. Something we touched up on is certainly something again that you To do in every language, but then maybe if it’s an object-oriented language that has some intricacies compared to a functional language. But again, if you take like the really 10,000 foot view of all of this, I think for the most part the principles are the same.
And then as sooner as more as you zoom in into the specifics of a particular language, there will certainly be nuances. But I think the overarching principle of you wanting to structure your project well, such that it Is suited to be reproduced by someone else and be understandable someone else. That’s really the key message and that’s the same across whatever tool you use, as long as it’s a tool where you write code. I would say if you try to do data science in a spreadsheet which I would never recommend, then I’m not sure if you can use these kinds of recommendations.
[00:20:27] Paolo: Thanks a lot, Thomas. It was a really, an interesting discussion and can’t wait to start a new project with these principles in mind. And see you soon for the next episode.
[00:20:39] Thomas: Thank you so much, Paolo. It was a pleasure. Bye.