Introduction to R and RStudio
R, RStudio, R-scripts and Quarto-documentsFile names should end in .R and be meaningful.
GOOD:
predict_ad_revenue.R
BAD:
foo.R
Use underscores ( _ ) to separate words within a name (see more here: http://adv-r.had.co.nz/Style.html)
The maximum line length is 80 characters.
# This is to demonstrate that at about eighty characters you would move off of the page
# Also, if you have a very wide function
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt + wgt * hgt + wgt * hgt * bmi, data = boys)
# it would be nice to pose it as
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg + bmi * hgt
+ bmi * wgt + wgt * hgt + wgt * hgt * bmi, data = boys)
#or
fit <- lm(age ~ bmi + hgt + wgt + hc + gen + phb + tv + reg
+ bmi * hgt
+ bmi * wgt
+ wgt * hgt
+ wgt * hgt * bmi,
data = boys)
When indenting your code, use two spaces. RStudio does this for you!
Never use tabs or mix tabs and spaces.
Exception: When a line break occurs inside parentheses, align the wrapped line with the first character inside the parenthesis.
apply(boys,
MARGIN = 2,
FUN = length)
Place spaces around all binary operators (=, +, -, <-, etc.).
Exception: Spaces around =’s are optional when passing parameters in a function call.
lm(age ~ bmi, data=boys)
or
lm(age ~ bmi, data = boys)
Do not place a space before a comma, but always place one after a comma.
GOOD:
tab.prior <- as_tibble(d[d$x < 2, "x"]) total <- sum(d[, 1]) total <- sum(d[1, ])
BAD:
# Needs spaces around '<' tab.prior <- table(df[df$days.from.opt<0, "campaign.id"]) # Needs a space after the comma tab.prior <- table(df[df$days.from.opt < 0,"campaign.id"]) # Needs a space before <- tab.prior<- table(df[df$days.from.opt < 0, "campaign.id"]) # Needs spaces around <- tab.prior<-table(df[df$days.from.opt < 0, "campaign.id"]) # Needs a space after the comma total <- sum(x[,1]) # Needs a space after the comma, not before total <- sum(x[ ,1])
Place a space before left parenthesis, except in a function call.
GOOD:
if (debug)
BAD:
if(debug)
Extra spacing (i.e., more than one space in a row) is okay if it improves alignment of equals signs or arrows (<-).
plot(x = x.coord,
y = data.mat[, MakeColName(metric, ptiles[1], "roiOpt")],
ylim = ylim,
xlab = "dates",
ylab = metric,
main = (paste(metric, " for 3 samples ", sep = "")))
Do not place spaces around code in parentheses or square brackets.
Exception: Always place a space after a comma.
Use common sense and BE CONSISTENT.
The point of having style guidelines is to have a common vocabulary of coding
If the code that you add to a script looks drastically different from the existing code around it, the discontinuity will throw readers out of their rhythm when they go to read it. Try to avoid this.
if(cond) {cons.expr} else {alt.expr}for(var in seq) {expr}Loops in R often happen under the hood, using apply functions:
apply(): apply a function to margins of a matrixsapply(): apply a function to elements of a list, returns vector or matrix (if possible)lapply(): apply a function to elements of a list, returns listOperation of an if statement:
Source: datamentor.io
Code of an if statement:
value <- 3
if (value > 3) { #text expression
print("Value greater than 3") #body of if
} Operation of an if-else statement:
Source: datamentor.io
Code of an if-else statment:
value <- 3
if (value > 3) { #test expression
print("Value greater than three") #body of if
} else {
print("Value <= 3") #body of else
}
## [1] "Value <= 3"
Operation of an if-else if statement:
Source: CS161 oregonstate.edu
Code of an if-else if statment:
value <- 3
if (value > 3) { #condition 1
print("Value greater than 3") #condition 1 statements
} else if (value > 1) { #condition 2
print("Value greater than 1") #condition 2 statements
} else if (value > 0) { #condition 3
print("Value greater than 0") #condition 3 statements
}
## [1] "Value greater than 1"
You can also add an else at the end.
Remember our example from last time
example_vector = c(1,2,3,4,5,6,7,8,9) example_vector>3
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
example_vector[example_vector>3]
## [1] 4 5 6 7 8 9
The computer keeps the value of the elements of example_vector if the corresponding elements in the condition (example_vector>3) are TRUE.
For loops are used when we want to perform some repetitive calculations.
# Let's print the numbers 1 to 6 one by one. print(1) ## [1] 1 print(2) ## [1] 2 print(3) ## [1] 3 print(4) ## [1] 4 print(5) ## [1] 5 print(6) ## [1] 6
For-loops allow us to automate this!
For each element of 1:6, print the element:
for (i in 1:6){
print(i)
}
## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6
You can use any variable name, i is a convention for counting/index.
for (some_var_name in 1:6){
print(some_var_name)
}
## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6
Source: datacamp.com
example_vector = c(1,2,3,4,5,6,7,8,9) example_vector>3
## [1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
example_vector[example_vector>3]
## [1] 4 5 6 7 8 9
For each element in example_vector, keep the value if the corresponding element of the condition (example_vector>3) is TRUE
Often you don’t want to iterate over a range, but over an object
for (element in c("Amsterdam","Rotterdam","Eindhoven")){
print(element)
}
## [1] "Amsterdam" ## [1] "Rotterdam" ## [1] "Eindhoven"
for (element in c("Amsterdam","Rotterdam","Eindhoven")){
print(element)
if (element == "Amsterdam"){
print("Terrible football team.")
} else {
print("No comments.")
}
}
## [1] "Amsterdam" ## [1] "Terrible football team." ## [1] "Rotterdam" ## [1] "No comments." ## [1] "Eindhoven" ## [1] "No comments."
Something a bit more useful
df <- data.frame("V1" = rnorm(5),
"V2" = rnorm(5, mean = 5, sd = 2),
"V3" = rnorm(5, mean = 6, sd = 1))
head(df)
## V1 V2 V3 ## 1 -1.1555706 3.987039 6.974957 ## 2 -1.6462298 5.041256 5.808818 ## 3 0.4646996 6.665295 4.938216 ## 4 -0.8045881 4.739685 4.422943 ## 5 1.0180784 7.675262 4.360817
Doing an operation on each column
for (col in names(df)) {
print(col)
}
## [1] "V1" ## [1] "V2" ## [1] "V3"
for (col in names(df)) {
print(col)
print(mean(df[, col]))
}
## [1] "V1" ## [1] -0.4247221 ## [1] "V2" ## [1] 5.621708 ## [1] "V3" ## [1] 5.30115
Doing an operation on each row
for (row in 1:nrow(df)) {
row_values = df[row, ]
print(row_values)
print(sum(row_values>5))
}
## V1 V2 V3 ## 1 -1.155571 3.987039 6.974957 ## [1] 1 ## V1 V2 V3 ## 2 -1.64623 5.041256 5.808818 ## [1] 2 ## V1 V2 V3 ## 3 0.4646996 6.665295 4.938216 ## [1] 1 ## V1 V2 V3 ## 4 -0.8045881 4.739685 4.422943 ## [1] 0 ## V1 V2 V3 ## 5 1.018078 7.675262 4.360817 ## [1] 1
Do something forever until a condition is (not) met
i = 0
while (i < 10) {
i = i + 1
print(i)
}
## [1] 1 ## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6 ## [1] 7 ## [1] 8 ## [1] 9 ## [1] 10
More info on loops: https://www.datamentor.io/r-programming/break-next/
For loops are very slow.
Operations in R are much faster when applied at once to a vector
example_vector = c(1,2,3,4,5,6,7,8,9) ifelse(example_vector > 5.5, "Pass","Fail")
## [1] "Fail" "Fail" "Fail" "Fail" "Fail" "Pass" "Pass" "Pass" "Pass"
apply() familyapply()The apply family is a group of very useful functions that allow you to easily execute a function of your choice over a list of objects, such as a list, a data.frame, or matrix.
We will look at three examples:
applysapplylapplyThere are more: - vapply - mapply - rapply - …
apply()apply is used for homogeneous matrices/dataframes. It applies a function to each row or column. It returns a vector or a matrix.
head(df, 1) ## V1 V2 V3 ## 1 -1.155571 3.987039 6.974957
Apply it by row (MARGIN = 1):
apply(df, MARGIN = 1, mean) ## [1] 3.268809 3.067948 4.022737 2.786013 4.351386
Apply it by column (MARGIN = 2):
apply(df, MARGIN = 2, mean) #Identical to colMeans(df), which is much faster ## V1 V2 V3 ## -0.4247221 5.6217075 5.3011502
apply()It doesn’t need to aggregate:
apply(df, MARGIN = 2, sqrt) ## Warning in FUN(newX[, i], ...): NaNs produced ## V1 V2 V3 ## [1,] NaN 1.996757 2.641014 ## [2,] NaN 2.245274 2.410149 ## [3,] 0.6816888 2.581723 2.222210 ## [4,] NaN 2.177082 2.103079 ## [5,] 1.0089987 2.770426 2.088257
sapply()sapply() is used on list-objects. It returns a vector or a matrix (if possible).
my_list <- list(A = c(4, 2, 1), B = "Hello.", C = TRUE) sapply(my_list, class)
## A B C ## "numeric" "character" "logical"
my_list <- list(A = c(4, 2, 1), B = c("hello","Hello","Aa","aa"), C = c(FALSE,TRUE))
sapply(my_list, range)
## A B C ## [1,] "1" "aa" "0" ## [2,] "4" "Hello" "1"
Why is each element a character string?
sapply()Any data.frame is also a list, where each column is one list-element.
This means we can use sapply on data frames as well, which is often useful.
sapply(df, mean)
## V1 V2 V3 ## -0.4247221 5.6217075 5.3011502
lapply()lapply() is exactly the same as sapply(), but it returns a list instead of a vector.
lapply(df, class)
## $V1 ## [1] "numeric" ## ## $V2 ## [1] "numeric" ## ## $V3 ## [1] "numeric"
Functions are reusable pieces of code that
We have been using a lot of functions: code of the form something() is usually a function.
mean(1:6)
## [1] 3.5
We can make our own functions as follows:
squared <- function (x){
x.square <- x * x
return(x.square)
}
squared(4)
## [1] 16
x, the input, is called the (formal) argument of the function. x.square is called the return value.
If there is no return(), the last line is automatically returned, so we can also just write:
square <- function(x){
x * x
}
square(-2)
## [1] 4
I do not recommend this, please always specify what you return unless you have a one-line function.
#Python df.apply(lambda x: np.percentile(x, .42))
#R
sapply(df, {function(x) quantile(x, .42)})
## V1.42% V2.42% V3.42% ## -0.9169025 4.9447532 4.7733288
is_contained <- function(str_1, str_2, print_input = TRUE){
if (print_input){
cat("Testing if", str_1, "contained in", str_2, "\n")
}
return(str_1 %in% str_2)
}
is_contained("R", "rstudio")
is_contained("R", "rstudio")
## Testing if R contained in rstudio
## [1] FALSE
is_contained("R", "rstudio", print_input = TRUE)
## Testing if R contained in rstudio
## [1] FALSE
is_contained("R", "rstudio", print_input = FALSE)
## [1] FALSE
##Python def square(x): """ Squares a number Parameters: x (float): Number (or vector) Returns: float: Squared numbers """ return(x**2)
##R more info at https://r-pkgs.org/man.html
#' Squares a number
#'
#' @param x A number.
#' @returns A numeric vector.
#' @examples
#' square(3)
square <- function(x){
x * x
}
Your first self-written for-loop, or function, will probably not work.
Don’t panic! Just go line-by-line, keeping track of what is currently inside each variable.
Stackoverflow and LLMs are your friends.
RWhen you write the name of a variable, R needs to find the value.
In the interactive computation (outside of functions, e.g., your console), this happens in the following order:
search()
## [1] ".GlobalEnv" "package:lubridate" "package:forcats" ## [4] "package:stringr" "package:dplyr" "package:purrr" ## [7] "package:readr" "package:tidyr" "package:tibble" ## [10] "package:ggplot2" "package:tidyverse" "package:stats" ## [13] "package:graphics" "package:grDevices" "package:utils" ## [16] "package:datasets" "package:methods" "Autoloads" ## [19] "package:base"
The order of packages is important.
R: FunctionsInside a function, this happens in the following order:
y <- 3
test_t <- function() {
print(y)
}
test_t()
## [1] 3
y <- 3
test_t <- function() {
y <- 2
print(y)
}
test_t()
## [1] 2
R: FunctionsWhat happens inside a function, stays within a function (unless you specify it differently)
y <- 3
test_t <- function() {
y <- 2
print(y)
}
test_t()
## [1] 2
y
## [1] 3
R: PackagesPackages are neatly contained/isolated, so they are not affected by your code.They do so through namespaces:
::dplyr::n_distinct(c(1,2,3,4,2))
## [1] 4
R: Packages (good practices)BAD
shifted_mean <- function(numbers) {
return(mean(numbers) + shift_by)
}
shift_by <- 3
shifted_mean(c(1,2,3))
GOOD
shifted_mean <- function(numbers, shift_by) {
return(mean(numbers) + shift_by)
}
shift_by <- 3
shifted_mean(c(1,2,3), shift_by)
RStudiopurrr::map for convenience)#) to clarify what you are doingEach project uses specific versions of the packages.
What happens if the function that you are using is deprecated in a new version?
We should separate the packages we use in each project.
mamba / poetry (Python): Use virtual environments to compartmentalize projectsrenv (R): Load the right version of packages when you open the projectmamba env create -n my_cool_project python=3mamba activate my_cool_projectmamba install jupyter pandas scipymamba export -n my_cool_project > my_cool_project.ymlmamba env remove -n my_cool_projectmamba env create --file my_cool_project.ymlrenv::init()install.packages("tidyverse")renv::snapshot()renv::restore()I just messed up something and closed the file. How do I go back?
Solutions:
git (e.g. provided by github)Workflow (for one person, not for teams):
R, RStudio, R-scripts and R-notebooks