Introduction to R and RStudio
R
, RStudio
, R
-scripts and Quarto
-documentsFile names should end in .R
and be meaningful.
GOOD:
predict_ad_revenue.R
BAD:
foo.R
Use underscores ( _ ) to separate words within a name (see more here: http://adv-r.had.co.nz/Style.html)
The maximum line length is 80 characters.
# This is to demonstrate that at about eighty characters you would move off of the page # Also, if you have a very wide function fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt + wgt * hgt + wgt * hgt * bmi, data = boys) # it would be nice to pose it as fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt + bmi * wgt + wgt * hgt + wgt * hgt * bmi, data = boys) #or fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt + bmi * wgt + wgt * hgt + wgt * hgt * bmi, data = boys)
When indenting your code, use two spaces. RStudio
does this for you!
Never use tabs or mix tabs and spaces.
Exception: When a line break occurs inside parentheses, align the wrapped line with the first character inside the parenthesis.
apply(boys, MARGIN = 2, FUN = length)
Place spaces around all binary operators (=, +, -, <-, etc.).
Exception: Spaces around =’s are optional when passing parameters in a function call.
lm(age ~ bmi, data=boys)
or
lm(age ~ bmi, data = boys)
Do not place a space before a comma, but always place one after a comma.
GOOD:
tab.prior <- as_tibble(d[d$x < 2, "x"]) total <- sum(d[, 1]) total <- sum(d[1, ])
BAD:
# Needs spaces around '<' tab.prior <- table(df[df$days.from.opt<0, "campaign.id"]) # Needs a space after the comma tab.prior <- table(df[df$days.from.opt < 0,"campaign.id"]) # Needs a space before <- tab.prior<- table(df[df$days.from.opt < 0, "campaign.id"]) # Needs spaces around <- tab.prior<-table(df[df$days.from.opt < 0, "campaign.id"]) # Needs a space after the comma total <- sum(x[,1]) # Needs a space after the comma, not before total <- sum(x[ ,1])
Place a space before left parenthesis, except in a function call.
GOOD:
if (debug)
BAD:
if(debug)
Extra spacing (i.e., more than one space in a row) is okay if it improves alignment of equals signs or arrows (<-).
plot(x = x.coord, y = data.mat[, MakeColName(metric, ptiles[1], "roiOpt")], ylim = ylim, xlab = "dates", ylab = metric, main = (paste(metric, " for 3 samples ", sep = "")))
Do not place spaces around code in parentheses or square brackets.
Exception: Always place a space after a comma.
Use common sense and BE CONSISTENT.
The point of having style guidelines is to have a common vocabulary of coding
If the code that you add to a script looks drastically different from the existing code around it, the discontinuity will throw readers out of their rhythm when they go to read it. Try to avoid this.
if(cond)
{cons.expr}
else
{alt.expr}
for(var in seq)
{expr}
Loops in R often happen under the hood, using apply functions:
apply()
: apply a function to margins of a matrixsapply()
: apply a function to elements of a list, returns vector
or matrix
(if possible)lapply()
: apply a function to elements of a list, returns list
Operation of an if statement:
Source: datamentor.io
Code of an if statement:
value <- 3 if (value > 3) { #text expression print("Value greater than 3") #body of if }
Operation of an if-else statement:
Source: datamentor.io
Code of an if-else statment:
value <- 3 if (value > 3) { #test expression print("Value greater than three") #body of if } else { print("Value <= 3") #body of else }
## [1] "Value <= 3"
Operation of an if-else if statement:
Source: CS161 oregonstate.edu
Code of an if-else if statment:
value <- 3 if (value > 3) { #condition 1 print("Value greater than 3") #condition 1 statements } else if (value > 1) { #condition 2 print("Value greater than 1") #condition 2 statements } else if (value > 0) { #condition 3 print("Value greater than 0") #condition 3 statements }
## [1] "Value greater than 1"
You can also add an else at the end.
Remember our example from last time
example_vector = c(1,2,3,4,5,6,7,8,9) example_vector>3
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
example_vector[example_vector>3]
## [1] 4 5 6 7 8 9
The computer keeps the value of the elements of example_vector if the corresponding elements in the condition (example_vector>3
) are TRUE.
For loops are used when we want to perform some repetitive calculations.
# Let's print the numbers 1 to 6 one by one. print(1) ## [1] 1 print(2) ## [1] 2 print(3) ## [1] 3 print(4) ## [1] 4 print(5) ## [1] 5 print(6) ## [1] 6
For-loops allow us to automate this!
For each element of 1:6
, print the element:
for (i in 1:6){ print(i) }
## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6
You can use any variable name, i
is a convention for counting/index.
for (some_var_name in 1:6){ print(some_var_name) }
## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6
Source: datacamp.com
example_vector = c(1,2,3,4,5,6,7,8,9) example_vector>3
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
example_vector[example_vector>3]
## [1] 4 5 6 7 8 9
For each element in example_vector, keep the value if the corresponding element of the condition (example_vector>3
) is TRUE
Often you don’t want to iterate over a range, but over an object
for (element in c("Amsterdam","Rotterdam","Eindhoven")){ print(element) }
## [1] "Amsterdam" ## [1] "Rotterdam" ## [1] "Eindhoven"
for (element in c("Amsterdam","Rotterdam","Eindhoven")){ print(element) if (element == "Amsterdam"){ print("Terrible football team.") } else { print("No comments.") } }
## [1] "Amsterdam" ## [1] "Terrible football team." ## [1] "Rotterdam" ## [1] "No comments." ## [1] "Eindhoven" ## [1] "No comments."
Something a bit more useful
df <- data.frame("V1" = rnorm(5), "V2" = rnorm(5, mean = 5, sd = 2), "V3" = rnorm(5, mean = 6, sd = 1)) head(df)
## V1 V2 V3 ## 1 -1.1555706 3.987039 6.974957 ## 2 -1.6462298 5.041256 5.808818 ## 3 0.4646996 6.665295 4.938216 ## 4 -0.8045881 4.739685 4.422943 ## 5 1.0180784 7.675262 4.360817
Doing an operation on each column
for (col in names(df)) { print(col) }
## [1] "V1" ## [1] "V2" ## [1] "V3"
for (col in names(df)) { print(col) print(mean(df[, col])) }
## [1] "V1" ## [1] -0.4247221 ## [1] "V2" ## [1] 5.621708 ## [1] "V3" ## [1] 5.30115
Doing an operation on each row
for (row in 1:nrow(df)) { row_values = df[row, ] print(row_values) print(sum(row_values>5)) }
## V1 V2 V3 ## 1 -1.155571 3.987039 6.974957 ## [1] 1 ## V1 V2 V3 ## 2 -1.64623 5.041256 5.808818 ## [1] 2 ## V1 V2 V3 ## 3 0.4646996 6.665295 4.938216 ## [1] 1 ## V1 V2 V3 ## 4 -0.8045881 4.739685 4.422943 ## [1] 0 ## V1 V2 V3 ## 5 1.018078 7.675262 4.360817 ## [1] 1
Do something forever until a condition is (not) met
i = 0 while (i < 10) { i = i + 1 print(i) }
## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6 ## [1] 7 ## [1] 8 ## [1] 9 ## [1] 10
More info on loops: https://www.datamentor.io/r-programming/break-next/
For loops are very slow.
Operations in R are much faster when applied at once to a vector
example_vector = c(1,2,3,4,5,6,7,8,9) ifelse(example_vector > 5.5, "Pass","Fail")
## [1] "Fail" "Fail" "Fail" "Fail" "Fail" "Pass" "Pass" "Pass" "Pass"
apply()
familyapply()
The apply
family is a group of very useful functions that allow you to easily execute a function of your choice over a list of objects, such as a list
, a data.frame
, or matrix
.
We will look at three examples:
apply
sapply
lapply
There are more: - vapply
- mapply
- rapply
- …
apply()
apply
is used for homogeneous matrices/dataframes. It applies a function to each row or column. It returns a vector or a matrix.
head(df, 1) ## V1 V2 V3 ## 1 -1.155571 3.987039 6.974957
Apply it by row (MARGIN = 1):
apply(df, MARGIN = 1, mean) ## [1] 3.268809 3.067948 4.022737 2.786013 4.351386
Apply it by column (MARGIN = 2):
apply(df, MARGIN = 2, mean) #Identical to colMeans(df), which is much faster ## V1 V2 V3 ## -0.4247221 5.6217075 5.3011502
apply()
It doesn’t need to aggregate:
apply(df, MARGIN = 2, sqrt) ## Warning in FUN(newX[, i], ...): NaNs produced ## V1 V2 V3 ## [1,] NaN 1.996757 2.641014 ## [2,] NaN 2.245274 2.410149 ## [3,] 0.6816888 2.581723 2.222210 ## [4,] NaN 2.177082 2.103079 ## [5,] 1.0089987 2.770426 2.088257
sapply()
sapply()
is used on list
-objects. It returns a vector or a matrix (if possible).
my_list <- list(A = c(4, 2, 1), B = "Hello.", C = TRUE) sapply(my_list, class)
## A B C ## "numeric" "character" "logical"
my_list <- list(A = c(4, 2, 1), B = c("hello","Hello","Aa","aa"), C = c(FALSE,TRUE)) sapply(my_list, range)
## A B C ## [1,] "1" "aa" "0" ## [2,] "4" "Hello" "1"
Why is each element a character string?
sapply()
Any data.frame
is also a list
, where each column is one list
-element.
This means we can use sapply
on data frames as well, which is often useful.
sapply(df, mean)
## V1 V2 V3 ## -0.4247221 5.6217075 5.3011502
lapply()
lapply()
is exactly the same as sapply()
, but it returns a list instead of a vector.
lapply(df, class)
## $V1 ## [1] "numeric" ## ## $V2 ## [1] "numeric" ## ## $V3 ## [1] "numeric"
Functions are reusable pieces of code that
We have been using a lot of functions: code of the form something()
is usually a function.
mean(1:6)
## [1] 3.5
We can make our own functions as follows:
squared <- function (x){ x.square <- x * x return(x.square) } squared(4)
## [1] 16
x
, the input, is called the (formal) argument of the function. x.square
is called the return value.
If there is no return()
, the last line is automatically returned, so we can also just write:
square <- function(x){ x * x } square(-2)
## [1] 4
I do not recommend this, please always specify what you return unless you have a one-line function.
#Python df.apply(lambda x: np.percentile(x, .42))
#R sapply(df, {function(x) quantile(x, .42)})
## V1.42% V2.42% V3.42% ## -0.9169025 4.9447532 4.7733288
is_contained <- function(str_1, str_2, print_input = TRUE){ if (print_input){ cat("Testing if", str_1, "contained in", str_2, "\n") } return(str_1 %in% str_2) }
is_contained("R", "rstudio")
is_contained("R", "rstudio") ## Testing if R contained in rstudio ## [1] FALSE is_contained("R", "rstudio", print_input = TRUE) ## Testing if R contained in rstudio ## [1] FALSE is_contained("R", "rstudio", print_input = FALSE) ## [1] FALSE
##Python def square(x): """ Squares a number Parameters: x (float): Number (or vector) Returns: float: Squared numbers """ return(x**2)
##R more info at https://r-pkgs.org/man.html #' Squares a number #' #' @param x A number. #' @returns A numeric vector. #' @examples #' square(3) square <- function(x){ x * x }
Your first self-written for-loop, or function, will probably not work.
Don’t panic! Just go line-by-line, keeping track of what is currently inside each variable.
Stackoverflow and LLMs are your friends.
R
When you write the name of a variable, R needs to find the value.
In the interactive computation (outside of functions, e.g., your console), this happens in the following order:
search()
## [1] ".GlobalEnv" "package:lubridate" "package:forcats" ## [4] "package:stringr" "package:dplyr" "package:purrr" ## [7] "package:readr" "package:tidyr" "package:tibble" ## [10] "package:ggplot2" "package:tidyverse" "package:stats" ## [13] "package:graphics" "package:grDevices" "package:utils" ## [16] "package:datasets" "package:methods" "Autoloads" ## [19] "package:base"
The order of packages is important.
R
: FunctionsInside a function, this happens in the following order:
y <- 3 test_t <- function() { print(y) } test_t() ## [1] 3
y <- 3 test_t <- function() { y <- 2 print(y) } test_t() ## [1] 2
R
: FunctionsWhat happens inside a function, stays within a function (unless you specify it differently)
y <- 3 test_t <- function() { y <- 2 print(y) } test_t()
## [1] 2
y
## [1] 3
R
: PackagesPackages are neatly contained/isolated, so they are not affected by your code.They do so through namespaces:
::
dplyr::n_distinct(c(1,2,3,4,2))
## [1] 4
R
: Packages (good practices)BAD
shifted_mean <- function(numbers) { return(mean(numbers) + shift_by) } shift_by <- 3 shifted_mean(c(1,2,3))
GOOD
shifted_mean <- function(numbers, shift_by) { return(mean(numbers) + shift_by) } shift_by <- 3 shifted_mean(c(1,2,3), shift_by)
RStudio
purrr::map
for convenience)#
) to clarify what you are doingEach project uses specific versions of the packages.
What happens if the function that you are using is deprecated in a new version?
We should separate the packages we use in each project.
mamba
/ poetry (Python): Use virtual environments to compartmentalize projectsrenv
(R): Load the right version of packages when you open the projectmamba env create -n my_cool_project python=3
mamba activate my_cool_project
mamba install jupyter pandas scipy
mamba export -n my_cool_project > my_cool_project.yml
mamba env remove -n my_cool_project
mamba env create --file my_cool_project.yml
renv::init()
install.packages("tidyverse")
renv::snapshot()
renv::restore()
I just messed up something and closed the file. How do I go back?
Solutions:
git
(e.g. provided by github)Workflow (for one person, not for teams):
R
, RStudio
, R
-scripts and R
-notebooks