In this episode, Paolo interviewed Thomas on why sharing your code with R packages for some data science projects is essential. When is it important? Where to start? What are the main steps and best practices? This skill could be a game-changer in your career as a data scientist, and nowadays, it’s much easier due to the introduction of new tools.
R Packages (2e) https://r-pkgs.org/
Sharing your Code with R Packages
[00:00:00] Paolo: Hi, Thomas.
[00:00:01] Thomas: Hey Paolo.
[00:00:01] Paolo: How you doing?
[00:00:03] Thomas: Very well.
[00:00:03] Paolo: We are here today to discuss. Something which is really important. So how to share your codes with R packages. So let’s start them and discuss first, what is an R package? Thomas?
[00:00:24] Thomas: Sure. So I think. The easiest way to visualize that is actually if I think most people already have moved at some point in their life, so they know these basically boxes where you put in your stuff and really and our package is just that. It’s a container where you put a certain things in and the stuff that you put in are not your clothes or whatever. Instead it’s basically R functions and or data. So typically an R package contains a lot of functions. But there are also art packages which contain only data, or which contain a mix of both.
That basically means it’s either contains a lot of, useful functionality in terms of the functions or some interesting data that you want to share with someone else. And in general, you create an R package for a particular purpose. So the scope should ideally be rather narrow. There’s also packages out there, which are called miscellaneous X, Y, Z.
But I think in general, that’s not something you should strive for. Instead, think of something like G plot two, which is specifically dedicated for data visualization or dli, which is specifically dedicated to processing your data or maybe an r package for text mining. So that’s always an narrow scope.
And then for that scope you put in Our functions and sometimes accompanying data to be able to, at the end of the day, share that with someone else such that they can reuse whatever you have done. And that’s really at the core of it. You write an R package when you want to share your code and that can either be within your own organization and you have maybe a team of data scientists working on similar projects and then you encapsulate certain piece of logic into that art package. Or if you embrace the open source spirit and actually upload it to something like cran and share it with kind of the data science just at large out there.
[00:02:12] Paolo: Okay. Thanks a lot. And what are the main differences in creating R packages between now and a few years ago? Because, for example, when I started with R was Almost 20 years ago in academia, it was the beginning and creating R packages was really intimidating for most of the people. Are things different now or?
[00:02:38] Thomas: I would definitely say so. There’s a lot of tools out there now that actually make developing an R package very easy and again, that’s in the form of several r packages, which encapsulate a particular logic. So for example, there is a very popular package called Roxygen2, which makes it easy to write documentation for your R package. But taking a step back, I think really the problem maybe a decade or more ago, was that also there was not a lot of let’s call it accessible documentation out there on how to do that.
I think certainly in the official art documentation, there is some description of how you do it, but if you now think of something like the R Packages book by Hadley Wickham and Jenny Bryan, which is a very accessible well-written book, which really gives you a step by step approach on how to build an R package.
Again, using a lot of the novel tools that are out there it’s much easier to get started now. Whereas a decade ago, you probably had to do everything from scratch. Had to manually create all the folders that you need, all these different files which are part of an R package, which at this point in time you can use a particular project for example, a particular package, for example, to use this package.
And then there’s a command that says use or create underscore package. And then it already creates a skeleton for you with a lot of the files that you need and the different folders which are required. So it’s much, much easier. And in general there is this one package called Dev Tools Short for development tools, which actually is a combination of multiple packages, but it packages again, everything up in this one package such that you can do, install the packages, dev tools, and suddenly you have all these great tools available at your disposal.
And I already mentioned Roxygen2 for writing documentation, which is part of that. And there’s the test that package, which makes it very easy to write unit tests in our, you have something called package down, which takes the documentation, which you will write anyway, but then puts it into a format that at the end of the day you have a web pack webpage up and running, such that people can also browse documentation of your package on the web.
Writing vignettes or long form documentation I think is much easier now with tools like our markdown and knitter. So in general, there’s just a lot of great tooling out there now, which makes it. Quite easy to write R packages. So I always say if people tell me that they are intimidated to write their first R package, as long as you know how to write a function that’s really the only prerequisite I think you need to have. Everything else using all these tools, which I just mentioned you can readily learn while you’re doing it and that way. I think the burden of entry right now to building our package is actually fairly low.
[00:05:19] Paolo: And right now I think that reading documentation website and yet it’s really nice today. It makes way easier to benefit from other codes cause you can see real examples and more text for explaining what’s in the package, how the package works what the different functions do. I think it’s really nice to see the evolution of the software documentation across the years. It’s really an interesting development in the space. And of course the growth of the community is helpful in that sense. And it’s really interesting ecosystem when you see the progress that has been made in this field. And when do you share codes with R package? Because sometimes I feel that creating an art project is enough, and maybe creating an art package it’s not worth it.
[00:06:26] Thomas: Yeah, I think that’s a very good question. So there are some people who are of the opinion that if you do a particular analysis, you should actually wrap it up into an R package. Personally I think that is not a good approach because it’s. At the end of the day, two different purposes. So if you do a project, then really it’s about performing some kind of analysis at the end of the day and producing some kind of outputs that you want to share. Because data’s still certain insights you got from your data, whereas I think of an r packaged as some code really in terms of reusable codes or functions that you want to share again, either within your own organization or with the data science community at large. So then oftentimes when you work on a particular r project and analysis, you may find that you need to implement new functions. And if it’s a couple functions, I think it’s totally fine to just keep them within the art project.
But then if you end up writing a bunch of functions and actually you feel like they could be reused not only for this particular analysis project, but maybe for the next project you do, or maybe also for the next project that someone else out there does, then it makes sense to say. Okay let’s maybe take out these functions, which are right now in my r project and put them in a dedicated package.
But again, I would not try to then morph your existing analysis r project into an R package. I just think the benefits you gained from that don’t outweigh the additional overhead and burden of creating an R package. So in that sense, I really like to think of it as two different things.
And R project is really focused around performing a particular kind of analysis and then our package is sharing functionality code with someone else such that they can reuse this functionality. But certainly as I mentioned, you can start off writing a function within an art project and later on turn that into an art package because you feel like Aus is actually something very much reusable.
[00:08:22] Paolo: And, where to start. If you want to start working in creating R packages?
[00:08:31] Thomas: Yeah. First of all I would say just don’t be intimidated. Just take on the challenge. And then I already mentioned this book, which is literally called R Packages by Hadley Wickham and Jenny Bryan. You can get a physical copy of that, but and order that. But it’s also freely available on the internet. So if you Google, R packages book you will find it there. It’s freely available for you to browse. And then Don’t try to reach through that whole thing. And then at the end of it, try to implement your R package. Just go through maybe one or two chapters and then, just try to do it yourself. Open up our studio. Click on new project as an art project, as an R package. And then it doesn’t have to be something very complicated. You can start off with an R package, which has one or two functions, but then you get a feeling for, Hey, this is where the code needs to live. This is how to write documentation.
This is how to write your unit test and take it one step at a time. If you start off just writing the functions themselves that’s a good first step. And then if you want to read a little bit more into the book and then start writing some documentation, definitely feel free to do think of it really as an iterative process. You don’t have to be perfect from the first time. I think that’s often something that holds people back, that they try to get it right, like immediately, but that’s just not how you learn things. I’m really much an advocate of learning by doing and making mistakes along the way and learning from them and that way.
You really very quickly should come to a point where you feel rather comfortable of writing your R package. And just from my own experience, I think if you’ve done it once and went through the process that doesn’t mean that the next time you do it, you have all the steps in your head, but you just know roughly, Hey, I need to do this and that. And then you can always refer back to the other great resources that are out there and read up on little details. But just going through it process once is probably the most important thing. So again, just Embrace the challenge in a way, and start small and then go from there.
[00:10:23] Paolo: I think it’s also important to practice to have a sort of daily commitment because yes, I mean it’s easy to not remember things in this space. It’s language we must practice for master we meet. And in general You can go back to your package, improve your documentation, maybe one day and just commit. And then next day could be a small tweak to one function. And then it becomes maybe natural in your workflow and all the components start to be really everything is on the right place in your mind after practicing after a few days and the book the R Packages book is now in it’s second edition. I think that the online version is really what I needed when I started to creating R packages. Cause you can refer to it very easily.
And you have everything in front of you can copy also and pass in your our window. And it’s nice to have a small chapter. The game which in in a few lines tells you the old story. It’s about creating something which is really small. But at the end, it’s really what a package, is. And then of course you can build more complex stuff, more complex documentation but at the end it gives you an idea of what’s going there. And I think that you will also Give some course on how to create our packages in the near future as part of the effective data scientist academy.
[00:12:15] Thomas: That is very correct. So that’s something I’m currently preparing and working on. I recently gave a workshop on our package development at a conference. So that was a good starting point. But there we only had I think one and a half hours. So that’s quite short. So on the Effective Data Scientist Academy will have something larger in, in size and content but again broken down into multiple modules. But really the idea is to basically as part of the course, build an R package. So it’s not you listening then to a lecture and then trying to somehow distill from that how to do it in practice. It’ll be very much hands-on and will together in a way, embark on this journey of creating your first little r package, which then should really Make you feel empowered to take that knowledge and apply it to some other problem you face within your organization and put that into an R package.
[00:13:08] Paolo: Getting practical what are the main steps for creating an R package?
[00:13:14] Thomas: Yeah. So first of all, in R package follows a certain structure. So for example, there’s an R folder where the functions live. There’s test folder where the unit tests live. There’s a man folder with documentation. So you first of all need to create that skeleton. So there’s dysfunction I already mentioned. It’s called use underscore a, excuse me, create underscore package from the use this package. And that sets up this kind of general structure for you. And then you really start off with creating a new r file saving it in this r folder, and then writing a function within there.
And that doesn’t have to be a super complex function. It can be something quite small to start awkward. So then it’s on file. But then you somehow want to use it, right? So instead of then calling library and then the name of your package which wouldn’t work because you don’t actually have installed it yet.
You call a function called load underscore, all from the dev tools package. And that takes whatever you have written in terms of functions and loads it. As if it were an installed package. And then you can experiment with that either on an a talk basis within the R console or preferably. You also write an accompanying test file for your r function where you specify, Hey, given this input by calling dysfunction, I’m expecting that this output is created. And that kind of formalizes the ad hoc test that many people typically do in the art console. But save them in a dedicated script and also for future reuse.
And then you maybe write a second test and you get a failing result because you didn’t actually take care of this particular case with your function. So you go back into your initial function, you do some adjustments, you again load it all and then you see, do my test pass now, and if yes, perfect.
That’s good. So I think already this short example shows you that it’s an iteration loop. It’s not like you write this one or multiple functions out at first. And then expect it to just work. Instead, you always implement a little bit of logic, see if it works, and then maybe try to expand up on that.
And then always try to ideally have test cases which check the logic and that way have this iterative loop. So then we also talked about writing documentation. So using this oxygen tool package, you can basically put the documentation on top of your actual functions so it sits together with your function.
So it’s a special way to comment things. And then if you want to build the actual documentation page, which you can access by using question mark, the name of your function and the R console, you would use the document function from the dev tools package. And then again you maybe run that, you look at how does this look like on the documentation page.
Oh. And you say, oh yeah, this looks already good, but I need to add a little bit more explanation for how this certain parameter looks like. So you go back to the function and then edit the documentation and again, all the document function. So again, an iterative process. And then one thing that is really important is to rather frequently run this function, which is called check from the dev tools package.
And that runs actually a yeah, a shell command, which is called R C M D check, which takes your R package, builds it, installs it, and then checks against a bunch of criteria. For example, there is a criteria within our packages that every parameter of the function needs to have a dedicated entry in the documentation.
And if it’s not there, then this function will flag it and tell you, Hey, this is what you need to take care of. And you wanna make sure at the end of the day that when you run DEF tools check, you get no errors, no warnings. And ideally also know what’s called notes. And then at some point you want to share your package with someone else.
So if it’s just within your organization, you maybe just build a package and then send it over to collaborators to install. But if it’s should be publicly available, then you would upload it to this pec repository called cre. And once it gets accepted there, everyone else could install the package using the install the packages command. So that’s the, on a rough overview with the lifecycle of the different steps of a package.
[00:17:12] Paolo: And of course, if you start, I think it’s important to say that as long as you set up the package skeleton, write a function and load hole, then you have already the feeling of having an R package. Then all the next steps are of course needed if you want to be efficient, if you want to share your package. But with the very first steps you have already, the fitting of having created something which is very nice. And do you have practical suggestions?
[00:17:50] Thomas: Yeah, so on that theme that you just mentioned where you just write a function, then you call out all, and then you already have something working that’s really good. So you start small and then you immediately see a success in a way and try to embrace that. Again, don’t try to have this huge package of multiple functions.
Everything needs to work. Just start small and then work your way it relatively towards your end goal. So that’s probably the most important suggestion. For beginners especially. Then I would suggest that if people write our packages that they make use of a version control system. Preferably I would use Git and then Git hosting service such as GitHub or GitLab. Because that just makes it really easier for you to experiment with stuff because you can commit at a point where, everything works now, and then you can experiment and see if you change a certain thing and does it actually improve stuff. If not, then you can just go back to the previous to the previous version of your code.
And also, even if you’re not at a point where you want to publish the package on Cran just yet, if it’s on something like GitHub or GitLab, other people can still install it. So that’s already in a way, then becomes a place where you can share your code. Then I already touched up on unit testing, so that’s something people typically don’t get too excited about, especially in the beginning. And I think that’s okay. If you start off just writing a bunch of functions and do some ad hoc testing in a console and that it works, that’s great. But at some point you will inevitably get to a point where you have a working function and then you. For some reason you want to refactor it, maybe because the logic internally is too difficult.
Maybe you see that the performance is actually not good enough. And then having tests which right now pass is really these tests are holding your back. Because if you then change something and you run the test and then you see actually now one test fails, immediately that your refactor in that sense wasn’t successful.
And I think a lot of times people don’t want to change their code because they know it works right now, but they’re not confident that if they change something that it will actually still work in all the cases that they previously tested just in the console. So writing tests and formalizing them is really something I would highly recommend.
And then just as a something to make your life easier, I would certainly suggest to use an IDE such as our studio or Visual studio code. But that’s probably not specific to our packages. It just makes writing code so much easier. But especially also in our studio, there’s a lot of built-in utilities to make your life easier to write our packages, for example, you can put your cursor inside a function definition and then click on generate oxygen skeleton and it will already prepopulate a documentation skeleton for you.
So you don’t have to do that by hand. And then I already alluded to this command called R cmd Check, which checks your R package against the bunch of criteria. Run it fairly often cuz that way you can confident that if you add something that it actually still The package still follows all the requirements that an art package has.
So it can be quite frustrating if you develop a lot of code maybe even for several weeks. And then only after that you run this command and then you end up with, three errors, 10 warnings, five nodes. That’s not a place you want to be in. Instead, whenever you make a meaningful change, run it again and see if everything still passes. Yeah, and I think we can cap it here.
[00:21:14] Paolo: Okay. Thanks a lot, Thomas. It was really interesting and we’ll put a lot of resources in the show notes for sure, for the benefit of our listeners. Thanks a lot.
[00:21:25] Thomas: Bye Paolo. Thank you.