Bionformatics for Beginners
Enhancing your workflow with RStudio and Git
Reproducibility is a critical aspect of bioinformatics[1],[2],[3] and if like me, you are beginning your journey into bioinformatics now is a great time to get into the good habits of annotating, documenting and streamlining your own workflows.
The Reproducible Bioinformatics Project [4] outlines 10 steps to produce easy to use bioinformatics workflows. Here I will outline how I implement some basic methods to improve the reproducibility of my work by producing easy-to-read and easy-to-share workflow documentation.
R Markdown
Most bioinformaticians use R (and RStudio[5]) as a tool for analysis, but it is also a great tool for presenting and documenting your work in R Markdown. You can create nice looking html documents without having to code in html simply by specifying a style in the document header. Indeed this blog was written in markdown using the “readthedown” theme from the rmdformats package.
Chunks!
It’s not uncommon to have long complex pipelines that include the use of bash, python and R. In .rmd documents you can add executable (or not) chunks for all of these and more, like so:
# bash, rename file
cd myprojectfolder/rnaseqfiles
rename "s/reallylongfilename/betterfilename/g" *reallylongfilename*
# for reproducibility include package versions
packageVersion("rmdformats")
Making python chunks executable does require some additional steps as detailed here: R Markdown Python Engine, but for documenting your workflow, it is still worth inserting and disabling its execution in the chunk header with:
```{python.reticulate=FALSE}
# step 2, we used the following script to so some things
myprojectfolder/somecoolscript.py
To Knit or not to Knit!
For your own records it’s not necessary to knit your document to html, but if you want to present your work or make your colleague who asked you “how did you run that pipeline again?” happy, then it is definitely worth doing so. Knitting the document will display all plots, code, comments and text, however the knit will only be successful if all code chunks run without error. If you don’t want to run or display a chunk there are a number of options to implement that here.
Using Github
Github is a site that provides hosting and version control for software development.
Why use it?
- At the most basic level, this is a remote store of your work.
- To easily share your work with others via the web.
- To collaborate and make use of branches and version control.
- To showcase your work to prospective employers.
Setup
To get started with github, create an account (it’s free!) at github.com.
Here you can create project folders (repositories) where you can upload your files (individual file size limit of 100MB). You could just treat this as a cloud service, but it is absolutely worth getting familiar with its extra features.
To set up and use GitHub with RStudio:
- Install git.
- Open RStudio, and select Tools > Global Options…
- In the window that pops up select Git/SVN and check “Enable version control interface for RStudio projects” and in the “Git Executable” box, find where your git.exe (windows) or git (mac/linux) is located.
- This will then require RStudio to restart.
Using RStudio and Git
A new project
After selecting “New Project” from the File menu, select New Directory > New Project and in the following window write the name of your directory and ensure the “Create a git repository” box is checked. This will create a remote repo with the same name as your directory.
An existing project
The easiest way to do this is to:
1. Create a repository on github and manually upload your project files to it.
2. Then in RStudio select “New Project” from the file menu and select Version control > Git.
3. In “Repository URL” you then need to copy and paste the HTTPS URL from the repository main page which can be found by clicking the “Code” button.
Syncing files
When you have made changes to your work, you can synchronize those changes on Github through RStudio by going the Git tab in RStudio (it’s near Environment, History etc) and ticking the checkboxes next to the files you wish to update. Then click “Commit”. In this window you can add a comment about the changes made before clicking Commit again. To sync the files to the online repository simply click the Push icon.
These are relatively simple methods to implement, but I have found them to be of great value in improving the readability and reproducibility of my work. While I only use Git in a basic manner this will lay the groundwork for learning and using its more advanced features.
References
Kanwal, S., Khan, F.Z., Lonie, A. and Sinnott, R.O. (2017) ‘Investigating reproducibility and tracking provenance - A genomic workflow case study’, BMC bioinformatics, 18(1), pp. 337. doi: 10.1186/s12859-017-1747-0.
Lawlor, B. and Sleator, R.D. (2020) ‘The democratization of bioinformatics: A software engineering perspective’, Gigascience, 9(6). doi: 10.1093/gigascience/giaa063.
Lin, X. (2020) ‘Learning Lessons on Reproducibility and Replicability in Large Scale Genome-Wide Association Studies’, Harvard Data Science Review, 2(4). doi: 10.1162/99608f92.33703976.
Kulkarni, N., Alessandri, L., Panero, R., Arigoni, M., Olivero, M., Ferrero, G., Cordero, F., Beccuti, M. and Calogero, R.A. (2018) ‘Reproducible bioinformatics project: a community for reproducible bioinformatics analysis pipelines’, BMC bioinformatics, 19(S10), pp. 349. doi: 10.1186/s12859-018-2296-x.
RStudio Team (2020). RStudio: Integrated Development for R. RStudio, PBC, Boston, MA URL http://www.rstudio.com/.
Getting R Package versions and citations
Here’s a quick workflow to extract all the packages that you used in your project.
Assumptions:
- Your scripts and markdowns for one project are all in one directory.
- All the packages you used in your project are currently installed in RStudio (if you only have one computer, or haven’t uninstalled and reinstalled Rstudio then this shouldn’t be a problem). If you aren’t sure, you can run section 3.1 to check if all your project packages are installed.
1) Use grep in project directory, add -R if you want to include subfolders.
If you are not on linux, you can just use the terminal in RStudio for the step below
# bash / terminal
grep -h "library*" *.Rmd | sort --u > packlist.txt
# Change .Rmd to .R if you use R scripts instead of markdown.
Note: You might want to have a quick look at this file and remove anything weird that isn’t library(packagename)
2) Read this list into R and remove everything except the package name
<- read.delim("packlist.txt", header = F)
packs
library(stringr)
<- na.omit(str_extract(string = packs$V1, pattern = "(?<=\\().*(?=\\))")) project_libs
3) Get a list of currently installed packages in Rstudio and thier versions.
<- as.data.frame(installed.packages()[,c(1,3:4)]) pack_inst
3.1) Check if there are any project packages are NOT currently installed, and if necessary install them.
<- unique(pack_inst$Package)
installed_libs <- setdiff(project_libs, installed_libs)
not_installed print(not_installed)
If there is something wierd in the not_installed list, add them to the exclude vector below and remove them.
<- "wierd_thing"
exclude <- not_installed[!(not_installed %in% exclude)]
not_installed print(not_installed)
Install missing packages
install.packages(not_installed)
Check if it worked. note: if you had a wierd thing, it will show up again here, but it doesn’t matter.
<- as.data.frame(installed.packages()[,c(1,3:4)])
pack_inst <- unique(pack_inst$Package)
installed_libs <- setdiff(project_libs, installed_libs)
not_installed print(not_installed)
If something still didn’t install, it’s probably best to just do it manually. ___________________________________________________________
4) Extract only the packages in your list
<- pack_inst[pack_inst$Package %in% project_libs, ]
result
# Print onscreen
print(result)
# or save to file
write(result, "package_vers.txt")
5) Get citations
<- result$Package
res
for (i in res){
print(citation(i))
}
The output can then be copy & pasted to file.