Love the lab, but hate data analysis? Want to start getting more creative with your figures but don’t know where to start? In this How-to, we outline the basics of R, the most commonly used data analysis tool in the life sciences. We will cover how to install IDEs and packages, outline some of the most common commands and functions and highlight some important uses of the tool so that you can get started on your data analysis journey.
What is R?
R is a programming language specifically designed for statistical computing and data analysis. It is widely used in various fields, including the life sciences, because of its versatility and extensive statistical capabilities. R is useful for importing, cleaning and transforming data, which makes it the perfect tool for scientists working with vast amounts of information. It also hosts an impressive suite of packages and functions for statistical analysis, data visualisation, hypothesis testing and more. It is widely used by bioinformaticians to assess swatches of genomic and other biological data.
But despite its ubiquitous nature, many researchers still struggle to use R. In this feature, we provide a rundown of the basics to get you started.
Installing an IDE
At its core, R is simply a programming language and can be used via the command line, but most people will find it easier to use R through an integrated development environment (IDE). There are a number of IDEs out there, but the most commonly used is RStudio, which, as the name suggests, is designed specifically for this purpose. For the purposes of this feature, we’ll assume you are using the RStudio desktop version.
RStudio provides a user-friendly interface and tools to make data analysis a smoother process. After downloading the IDE (remember to install R itself first!), you will typically open RStudio to reveal a series of panels. These are:
- Script Editor (top left): This is where you write and edit your R code. You can create new scripts or open existing ones. You can run this code by pressing ctrl+enter.
- Console (bottom left): The console is where you can execute R commands and see their output in real-time. You can also use it to interact with R directly; however, using a script that can be saved and edited is recommended.
- Environment/History (top right): This panel displays information about the objects in your R workspace (e.g., data frames, variables). You can also view your command history here.
- Files/Plots/Packages/Help (bottom right): These panels allow you to navigate your file system, view plots, manage R packages and access help documentation.
Figure 1: Image of the RStudio interface showing each different panel. Taken from Wikipedia, credit to: cdhowe.
Create a script
You will usually run analyses in R by creating a script. To create a new script in RStudio, you will navigate to the top of the page and click File > New file > New script. This will open a new script in the editor panel.
An R script operates with many of the same principles as other languages, but the language has its own unique features. Below, we detail some of the common aspects of R scripts and their basic syntax.
Keywords: Keywords are special words in R that hold a predetermined meaning, such as ‘print’ or ‘while.’
Comments: Comments start with a # symbol, denoting lines of code that are descriptive and should not be run as part of the command.
Defining a variable: Variables are used to store your data. Use the symbol <- (or, less commonly, =) to set something equal to something else ie. a <- 1, b = 2
Functions: Some functions are built into the R language or associated package, whilst others can be created by the user. Functions consist of a name and arguments. For example, the function ‘print’, which is used to output results to the console, would be written as print(argument). If this argument is defined elsewhere in the script (a global variable) or in the function itself (local variable), it will be outputted to the console.
To create your own functions, you would define the function name and set it equal to ‘function(arguments).’ You would then define what should happen to these arguments, ie., adding them together. You could then include code in the function that would print your result and output it to the console (print(result)). See the example below.
Figure 2: Image showing a basic example of a newly created function. To call this function, you would run the command ‘myfunction(a,b)’. Note: R has an inbuilt addition function, and as such it is unlikely you would need to build such a function yourself.
Hint: Function names in R are case-sensitive and should follow R’s naming conventions; usually lowercase letters with words separated by underscores or full stops.
Organising your data
One of the most important steps of any analysis is organising your data. Transforming a messy or unannotated dataset into something suitable for further work is vital when dealing with complex results. The most common place to start is by assessing your dataset. Below, we detail some common commands to help you understand what you’re dealing with.
head() – inputting the name of your dataset into the brackets allows you to view the first row. This can be useful in helping you to spot any obvious errors with your setup.
tail() – similar to the above, but allows you to view the bottom row.
str() – this command will display the internal structure of an object or dataset. For example, if a vector name is input, the result will show the contents of the vector.
summary() – this command can succinctly summarise your dataset.
Once you have used the above commands to grasp the structure and contents of your dataset, you can begin to tidy. One of the most common ways to do this is via tidyverse packages (more details on how to install these packages below). The dplyr package, part of tidyverse, contains functions to select, delete and rename columns. Additionally, the tidyr package contains functions to deal with missing data such as na.omit() and complete.cases(), which eliminate missing values. Further, if you wish to combine multiple datasets, you can use the bind_rows() or bind_col() functions to attach these datasets along either their rows or columns. These commands can also be used to add new columns or rows to preexisting dataframes.
Packages can be installed in RStudio in one of two ways – using the install.packages() function or manually through the packages section of the bottom right panel. To use the install.packages() function, you would input the command into your script and add the name of the package in the brackets.
Some packages will have prerequisites; other packages that must be installed alongside them. Often, installing one of these packages will subsequently install them all, but be sure to double check. Installing packages can take a while, so don’t be alarmed!
Hint: Keep your packages up to date by using the function ‘update.packages()’.
Packages can be installed at the same time as others by separating their names with commas in the function. Sometimes, R will ask you to select a CRAN mirror to install the packages. Choose one that is located close to you for ease and efficiency.
To begin using your packages, make sure you use the function ‘library(packagename)’ to load it, every time you start a new session.
Hint: You can access help pages by using the function ‘help(packagename)’.
Popular uses in the life sciences
R is a widely used tool for data analysis in the life sciences, and there are several popular packages that cater to specific needs in various subfields. Here are some popular R packages used in the life sciences and their typical applications:
- Bioconductor Packages: Bioconductor is a collection of R packages specifically designed for the analysis of genomic and bioinformatics data. Some commonly used Bioconductor packages include:
- –limma: For analysing microarray data and detecting differential gene expression.
- –GenomicRanges: For working with genomic intervals and annotations.
- –GenomicFeatures: For manipulating genomic features like genes and transcripts.
- –flowCore: For assessing flow cytometry data.
- dplyr and tidyr: These packages are part of the ‘tidyverse’ ecosystem and are essential for data wrangling, cleaning and transformation. They help prepare data for analysis, which is a crucial step in the life sciences when dealing with large datasets.
- ggplot2: Also part of the ‘tidyverse’, this package is used for creating highly customisable and publication-quality plots and graphs. It’s valuable for visualising data in the life sciences, such as gene expression profiles, protein structures and epidemiological data.
- phyloseq: This package is used for the analysis and exploration of microbiome data, including data from 16S rRNA sequencing and metagenomics studies.
- shiny: This package is used for building interactive web applications with R. It’s beneficial for creating data dashboards and tools to visualise and share research findings.
- qqman: A popular package used to assess and visualise GWAS data.
These are just a few examples of popular R packages used in the life sciences. Depending on the specific subfield, researchers may rely on different packages to perform data analysis, visualisation and statistical modelling. All packages have documentation on CRAN as standard practice, including examples of how to use it.
Setting up your environment
One of the most important steps when using R is to set up your environment. This involves importing any data that you may need. The classic R commands that you can use to read in CSV or Excel files (two of the most common files used in the life science field) are read.csv(file_name) or read.excel(file_name). Alternatively, you can take advantage of the RStudio GUI by going to the top right-hand panel and clicking ‘Import Dataset’, then choosing the file in your explorer.
Hint: Make sure to correctly set your working directory! Click ‘Session’ and ‘Set working directory’ and then choose the directory you wish to work in.
Some packages may come with their own functions to upload specialised file types. Additionally, you can create your own files in R, usually a dataframe. This can be done by using one of the most common R functions, data.frame(). Columns and values can be defined with the brackets, such as in the example below:
Figure 3: Example code that creates a dataframe (new_dataframe) with three columns, each containing personal information (name, age and score). Note that in this example, the = symbol is used to define the columns, rather than <-.
New files generated by your commands will also appear in the environment, alongside any variables or values that you have defined.
Deep dive: GGPLOT2
Ggplot2 is a commonly used tool for data visualisation in the life sciences. Through a number of specific commands, you can tell ggplot2 exactly how to set up your figure. The package is built on the ‘grammar of graphics’ concept that assumes you can build a plot with standard components by adding layers to the figure. These components are data, a coordinate system and ‘geoms’.
All ggplot2 commands begin with the creation of a ‘ggplot’ – using the function ggplot(). In the brackets, you will specify the arguments that will distinguish your ggplot from others.
The first argument will be the dataset you wish to plot. You will define this by using the term data = yourdataname.
Next, you will usually use the mapping = aes() argument. This contains information about the ‘aesthetics’ of your plot. This can include defining your x and y coordinates (x = … and y = …), a colour scheme, or sizes and shapes of your plots components. This will be specified within the brackets of the initial ggplot function.
Next, you will add your ‘geoms.’ Geometric objects, or geoms, are the graphical elements that represent the data points in your plot. Common geoms include geom_point() for scatter plots, geom_line() for line plots and geom_bar() for bar charts. You add geoms as layers to the plot using the + operator after the ggplot() brackets have been closed.
You can also add other layers such as labels and titles, which are added by using the + operator and the labs() function.
After running the code, you can view your plot in the bottom right panel of RStudio by calling its name in your script. You can also save your plot using the command ggsave(filename).
Helpful in-built functions
Let’s take a look at some helpful in-built R functions.
R can perform some basic arithmetic operations.
- Adding: sum <- x + y
- Subtracting: difference <- x – y
- Multiplying: product <- x * y
- Dividing: quotient <- x / y
You can create a vector with the below command.
- vectorname <- c(arg1,arg2,arg3…)
Some basic plots can be built using the commands below.
- Scatter plot: plot(x, y)
- Histogram: hist(data)
- Bar plot: barplot(table(data))
And there are some basic statistical tests that come built into the basic R package.
- Two sample T-test: t.test(sample1, sample2)
- Paired T-test: t.test(sample1, sample2, paired = TRUE)
- Chi-squared test of independence: chisq.test(data)
- Wilcoxon Rank Sum test: wilcox.test(sample1, sample2)
R Project. 2023. Available at: https://www.r-project.org/
Tidyverse. 2023. Ggplot2. Available at: https://ggplot2.tidyverse.org/
Cdhowe. Own work, CC BY-SA 4.0. Available at: https://commons.wikimedia.org/w/index.php?curid=101293607
Posit. 2023. R Studio. Available at: https://posit.co/download/rstudio-desktop/