R Bootcamp Day 1: Introduction to R
Goals
By the end of this tutorial, you will know:
- Basic
R
usage (usingR
as a calculator, creating variables, indexing) - Basic types of data in R
- How to get values out of the rows and columns in your data
- Basic iteration (for loops and functions)
The best way to do this tutorial is to walk through it slowly, executing each line and trying to understand what it does. You can execute a single line by placing your cursor somewhere in that line and hitting CTRL+enter on a PC or CMD+enter on a mac; you can execute a whole chunk at a time by hitting CTRL+shift+enter on a PC.
1. Basic R Use
1.1. Basic Computing
R can be used as a calculator.
You can do basic arithmetic:
# Add
2 + 3
## [1] 5
# subtract
2 - 3
## [1] -1
# multiplication
2 * 3
## [1] 6
# Division
10/2
## [1] 5
# Exponents
4^2
## [1] 16
# Roots
4^(1/2)
## [1] 2
Thankfully, R follows the order of operations (PEMDAS).
(2^3)+4*(5/3)
## [1] 14.66667
These values aren’t stored anywhere though.
1.2 Storing values in objects
To keep a value in memory, we need to assign it to an object.
R has two assignment operators: =
& <-
. Let’s use these to create an object called x, that is assigned the number 8.
x <- 8
x = 8
Now we can call x
by name; if we do this, it will print the value.
x
## [1] 8
Although <-
and =
both work, we’ll strictly use <-
from here on out. This brings up a general coding principal of the day:
Style is important! You want to use a consistent style so that others (including your future self) can easily and quickly read your code. I’ll be using Hadley Wickham’s Style Guide from here on out, part of which is assigning variables with <-
. Remember, the keyboard shortcut Alt+-
or Option+-
can be used to insert a <-
.
Now, let’s take a look at the variable y
:
# Call Y
y
D’oh! We didn’t assign anything to y
. Calling a variable that doesn’t exist leads to an Error
,
# Need to assign something to y!
y <- 2
y
## [1] 2
Now, we can use these variables in calculations
# add em up:
x + y
## [1] 10
# multiply them
x * y
## [1] 16
But remember, if we want to save those calculations, we need to assign them:
# define the product as z
z <- x * y
1.3 Naming Variables
You may be wondering how to name variables in R. Well, there are just a few rules:
- object names must start with a letter.
- can contain alphanumeric characters, "_“, and”.".
- R is case sensitive, so
R
andr
would be different variables.
This is another element of style. Using a consistent style when naming variables makes your life easier.
Some options include:
- snake_case
- SCREAMING_SNAKE (but why)
- camelCase
- you.can.use.periods (but don’t!)
.
should be avoided, because it is used to define methods in R and you could end up conflicting with a method name.
I’ll be using snake_case (again see Hadley Wickham’s Style Guide), but you can change to another option if you’d like.
this_is_snake_case <- 8 # a_fine_choice
thisIsCamelCase <- 8 #alsoTotallyWorks
some.maniacs.use.periods <- 8 # probably.best.avoided
And_someTimes.People_doSTRANGEtHiNgS <- 8 # PLEASE NEVER DO ANYTHING LIKE THIS!
Exercise 1.3a.
Create a variable called var_1 that is equal to 4. Then create a variable called var_2 that is var_1 raised to the power of 6. Print the results.
2. Types of Data
2.1. Types of Values
R has different types of data, and an object’s type affects how it interacts with functions and other objects. So far, we’ve just been working with integers. R has several other types of data:
Type | Definition | Example |
---|---|---|
Integer | whole numbers from -Inf to +Inf | 1, 2, -1, -10928 |
Numeric / Double | fractions & decimals from -Inf to +Inf | 1.3, .293764, -.1239217, Inf , -Inf |
Character / String | quoted strings of letters, numbers, and allowed symbols | “1”, “one”, “o_n_e”, “o-n-e”, “o.n.e” |
Logical / Boolean | logical constants of True or False | TRUE, FALSE, T, F |
factor | ordered, labelled variable | variable for year in college labelled “Freshman”, “Sophomore”, etc. |
There are a few special types worth knowing about too:
Type | Definition | Example |
---|---|---|
NA | Missing value; technically represented as different types but displayed as NA | NA, NA_character_, NA_integer_ |
NaN | ‘not a number’ | 0/0 |
+Inf | positive infinity | Inf |
-Inf | negative infinity | -Inf |
NULL | The NULL object; an object that exists but is completely empty | NULL; |
You can use typeof()
to find out the type of a value or object:
typeof(1)
## [1] "double"
typeof(TRUE)
## [1] "logical"
typeof(1L)
## [1] "integer"
typeof(NA_character_)
## [1] "character"
typeof(NA)
## [1] "logical"
typeof(NULL)
## [1] "NULL"
typeof(NaN)
## [1] "double"
typeof(Inf)
## [1] "double"
typeof("one")
## [1] "character"
2.2 Vectors
Often, we’re not working with individual values, but with multiple related values or a vector of values.
2.2.1 Creating vectors
We can create a vector of ordered numbers using starting_number:ending_number
. For example, we could make x
a vector with the numbers between 1 and 5:
x <- 1:5 # sequence of the form starting_number:ending_number
You’ll notice that x has changed in the environment pane. Since it’s a vector, it tells us what type of vector it is and it’s length in addition to its contents (which can be abbreviated if the object is larger). In this case, it’s a vector of integers (hence int
) and is of length 5 (hence [1:5]
).
We can also create a sequence using the function seq()
. seq()
is a function that creates a sequence of numbers. Like all R functions, you can learn more about it by accessing the help documentation (?+function_name
). Let’s take a look:
?seq
Interpreting R help documention is a skill itself, but a very worthwhile one.
What happens if we run seq()
with no arguments?
seq()
## [1] 1
2.2.1.Q1
Why did this happen?**
A:
To make a sequence from 1 to 5 with this function, we have to set the from =
and to =
to those boundaries:
seq(from = 1, to = 5)
## [1] 1 2 3 4 5
We can also set one or more of the other arguments. The by =
argument allows us to change the increment of the sequence. Let’s say we wanted every other number between 1 and 5. We can do this by setting by = 2
seq(from = 1, to = 5, by = 2)
## [1] 1 3 5
We can also create a vector of any numbers we want using the c()
function. You do this by putting numbers seperated by a ,
within the parentheses.
# We can also assign a vector of values this way
x <- c(2, 8.5, 1, 9)
Note that it became a numeric vector (hence num
), which is different from a vector of integers in that it allows decimal points. This brings us to an important feature of vectors:
Vectors are just 1-dimensional sequences of a single type of data. Because 1:5
was just integers, R stored the vector as int
. Since we have 8.5
, all of the values will be numbers, even the data that could be integers.
The general rule R uses is to set the vector to be the most permissive type necessary.
Note that vectors can also include strings or character values.
letters <- c("a", "b", "c", "d")
What happens if we put the vectors x
and letters
together? Let’s take a look:
mixed_vec <- c(x, letters)
mixed_vec
## [1] "2" "8.5" "1" "9" "a" "b" "c" "d"
Notice the quotes? R turned all of our numbers into strings, since strings are more permissive than numbers. You can see that by running typeof()
:
typeof(mixed_vec)
## [1] "character"
This is called coercion. R coerces a vector into whichever type will accomodate all of the values. We can also coerce a vector ourselves using the as.
family of functions. Let’s coerce mixed_vec into a numeric:
as.numeric(mixed_vec)
## Warning: NAs introduced by coercion
## [1] 2.0 8.5 1.0 9.0 NA NA NA NA
Exercise 2.2.1a
Create an object called x that is assigned the number 8. Create an object called y that is a sequence of numbers from 2 to 16, by 2. Add x and y. What happens?
Exercise 2.2.1b
Create an object called z that is a vector containing the numbers 5 and 10. Create an object called q that is a sequence of numbers from 5 to 25, by 5. Multiply q and z. What happens?
Exercise 2.2.1c
Create an object called a that is just the letter “a”. Add a to x (from above). What happens?
Exercise 2.2.1d
Create a vector called b that is just the number 8 in quotes. Add b to x (from above). What happens?
Exercise 2.2.1e
Find the type of each of the objects we just made (x, y, q, z, b)?
Exercise 2.2.1f
Find some way to add b to x (hint: you may need to be coercive).
2.2.2 Indexing vectors
2.2.2.1 Numerical Indexing
Vectors are all indexed numerically in R, starting with 1. Since vectors are only 1 dimension, we index them with a single number in []
. Let’s get just the first element of a new vector z
:
z <- 6:10
z[1]
## [1] 6
What if we wanted the first two elements? We could do this by passing the sequence 1:2
as our index:
z[1:2]
## [1] 6 7
Or, we could get the first and third element by passing those indexes as a vector using c()
:
z[c(1, 3)]
## [1] 6 8
We could also say which elements not to give us. Let’s get rid of the first element:
z[-1]
## [1] 7 8 9 10
Or the first and third:
z[-c(1, 3)]
## [1] 7 9 10
2.2.2.2 Name Indexing
If the elements in the vector have names, we can refer to them by name instead of their numerical index. You can see the names of a vector using names()
names(z)
## NULL
Looks like the elements in z have no names. We can change that by assigning them names using a vector of names:
names(z) <- c("first", "second", "third", "fourth", "fifth")
And we can see the names:
names(z)
## [1] "first" "second" "third" "fourth" "fifth"
z
## first second third fourth fifth
## 6 7 8 9 10
And use them to subset certain elements
z["first"]
## first
## 6
z[1]
## first
## 6
2.2.2.3 Manipulating elements in vectors
One thing that you can do with indexing is change elements within the vector. For example, we could change the first elemenet of z
to missing, or NA
:
z[1] <- NA
z
## first second third fourth fifth
## NA 7 8 9 10
You can also use indexing to transform a single element in a vector. Let’s raise the second element to the power of 2:
z[2] <- z[2]^2
z
## first second third fourth fifth
## NA 49 8 9 10
Or use it to add to a vector. Let’s add the number 11 as the sixth element of the vector z
z[6] <- 11
z
## first second third fourth fifth
## NA 49 8 9 10 11
And you can add a name for that element by referencing the index for the names
attribute of z
:
names(z)[6] <- "sixth"
z
## first second third fourth fifth sixth
## NA 49 8 9 10 11
Exercise 2.2.2a.
Create a vector called
named
that includes the numbers 1 to 5. Name each value a, b, c, d, and e (in order). Print the first element using numerical indexing and the last element using name indexes.
Exercise 2.2.2b.
Change the third element of
named
to the value 21 and then show your results.
2.3 Lists
Vectors are great for storing a single type of data, but what if we have a variety of different kinds of data we want to store together. For example, let’s say I want to store the year I am in the PhD program (a number), my name ( a string), and my enrollment status (a logical) in a single object that preserved these different types? A vector won’t work, but a list
will.
Lists are like vectors in that they are a 1-dimensional object, but they can contain heterogeneous data.
2.3.1 Creating Lists
We can create a list with the list()
function:
cory <- list(6L,
"Cory Costello",
TRUE)
cory
## [[1]]
## [1] 6
##
## [[2]]
## [1] "Cory Costello"
##
## [[3]]
## [1] TRUE
And, we can give each of these entries names to make things a little easier to keep track of:
cory <- list(year = 6L,
name = "Cory Costello",
enrollment = TRUE)
cory
## $year
## [1] 6
##
## $name
## [1] "Cory Costello"
##
## $enrollment
## [1] TRUE
Notice that the [[1]]
, [[2]]
, and [[3]]
, the element indices, have been replaced by the names year
, name
, and enrollment
. You can see also see the names of a list by running names()
on it:
names(cory)
## [1] "year" "name" "enrollment"
Lists are even more flexible than you’ve seen so far. In addition to being of heterogeneous type, each element of a list can be of different dimensions. Let’s add another element to the list about me that contains my favorite types of data, which will be a vector of length 2:
cory <- list(year = 6L,
name = "Cory Costello",
enrollment = TRUE,
fave_data = c("strings", "numbers")) # note c() to create a vector
cory
## $year
## [1] 6
##
## $name
## [1] "Cory Costello"
##
## $enrollment
## [1] TRUE
##
## $fave_data
## [1] "strings" "numbers"
2.3.2 Indexing Lists
Like vectors, lists can be indexed by their name or their position (numerically).
2.3.2.1 Indexing Lists Numerically
For example, if we wanted the year variable, we could get it out using its position as the first element of my list:
cory[1]
## $year
## [1] 6
Now let’s say we want to know my PhD year in dog years. Let’s see if we can get that by multiplying the year
element by 7:
cory[1] * 7
Okay, we get a pretty confusing error that we supplied a non-numeric argument, which in this case is cory[1]
. This happened because single bracket indexing on a list produces a list and lists can’t be multiplied by a number (at least not with just *
). If we want the actual object stored at the first position instead of a list with that object, we have to use the double-bracket indexing list[[i]]
:
cory[[1]]
## [1] 6
Notice it no longer has the $year
; $label
is a hint that you’re looking at a list and not the object stored at that position. Now let’s see my PhD year in dog years:
cory[[1]] * 7
## [1] 42
2.3.2.1 Indexing Lists by Name
The same applies to name indexing. With lists, you can get a list containing the indexed object with []
:
cory["year"]
## $year
## [1] 6
And double brackets [[]]
can be used to get the object stored with that name:
cory[["year"]]
## [1] 6
You can also use list$name
to get the object stored with a particular name too:
cory$year
## [1] 6
Manipulating Lists with Indexes
Just like vectors, we can change or add elements to our list using indexing. For example, let’s save the dog years transformation of the year element as dog_years
cory[["dog_years"]] <- cory[["year"]] * 7
cory
## $year
## [1] 6
##
## $name
## [1] "Cory Costello"
##
## $enrollment
## [1] TRUE
##
## $fave_data
## [1] "strings" "numbers"
##
## $dog_years
## [1] 42
This would also work:
cory$dog_years <- cory$year*7
cory
## $year
## [1] 6
##
## $name
## [1] "Cory Costello"
##
## $enrollment
## [1] TRUE
##
## $fave_data
## [1] "strings" "numbers"
##
## $dog_years
## [1] 42
Exercise 2.3.2a.
Create a list like mine that is made up of
year
,name
,enrollment
,fave_data
, but corresponds to you (professors, or postdocs can use year in position, years since PhD, or make a number up). Make sure you enter two types of favorite data (bc who could choose?!). Make it print your name.
2.3.3 Indexing Objects within Lists
As we saw with the object fave_data
stored in the list cory
, objects within lists can have different dimensions and length. What if we wanted just one element of an object in a list, such as just the second element of fave_data
? We can use indexing on the fave_data
vector stored within the cory
list by chaining indexes.
We could do that with numbers:
cory[[4]][2]
## [1] "numbers"
Or with type of list name indexing:
cory[["fave_data"]][2]
## [1] "numbers"
cory$fave_data[2]
## [1] "numbers"
Exercise 2.3.3a
Using the same list, get your PhD year in dog years (
*7
) and assign it to the 2nd position of theyear
object. Then show the newyear
object.
2.4. Matrices & Arrays
Matrices are 2-dimensional objects that store a single type of data, so you can think of them like a 2-dimensional vector.
2.4.1 Creating Matrices
We can create matrices using the matrix()
function. For example, let’s create a 2 X 2 matrix of the numbers 1 to 4.
mat <- matrix(data = c(1, 2, # data in either row-wise or column-wise order
3, 4), # don't have to do two lines, but it looks nice
nrow = 2, # number of rows
ncol = 2, # number of columns
byrow = TRUE) # tell it the order of data above
mat
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
Or, let’s say we had two columns, c1 and c2. We could put them together into a matrix using cbind()
or column-bind:
c1 <- c(1, 3)
c2 <- c(2, 4)
mat <- cbind(c1, c2)
mat
## c1 c2
## [1,] 1 2
## [2,] 3 4
This accomplishes basically the same thing, except our two columns have names.
We could do something similar with two rows, called r1 and r2, using the rbind()
or row-bind function:
r1 <- c(1, 2)
r2 <- c(3, 4)
mat <- rbind(r1, r2)
mat
## [,1] [,2]
## r1 1 2
## r2 3 4
This is essentially the same, but it has row names. We can see that by running rownames()
on it:
rownames(mat)
## [1] "r1" "r2"
You’ll notice it doesn’t have column names, which we can check with colnames()
:
colnames(mat)
## NULL
But, we could assign column names using the colnames()
function:
colnames(mat) <- c("c1", "c2")
mat
## c1 c2
## r1 1 2
## r2 3 4
Finally, matrices behave a lot like vectors; they can only accomodate homogenous data, and so it will coerce the data into whatever type is most accomodating:
mixed_mat <- cbind(mat,
c3 = c("a", "b"))
mixed_mat
## c1 c2 c3
## r1 "1" "2" "a"
## r2 "3" "4" "b"
Note that it has now turned everything into strings.
2.4.2 Indexing Matrices
Indexing matrices is just like indexing vectors, except we have two dimensions, which we use like so:
[row, column]
Let’s get the element at row 1 column 1 of the mat
object:
mat[1, 1]
## [1] 1
Or we could get the whole 2nd column by leaving the row index blank:
mat[,2]
## r1 r2
## 2 4
And, we can use names instead of numbers if we like:
mat["r1", "c1"]
## [1] 1
A mix:
mat["r1", 2]
## [1] 2
Exercise 2.4.2a
Make a matrix called mat2 that is a 3 X 3 matrix with rows 1, 2, 3, 4, 5, 6, 7, 8, 9. Make the column names c1, c2, and c3, and make the row names r1, r2, and r3. Add the first column to the third column.
Exercise 2.4.2b
Multiple the 3rd row by the value in the 2nd row and 2nd column.
2.4.3 Arrays
Arrays are like matrices, but can have N-dimensions (more than 2).
array(c(mat, mat*2, mat*3), dim =c(2, 2, 3))
## , , 1
##
## [,1] [,2]
## [1,] 1 2
## [2,] 3 4
##
## , , 2
##
## [,1] [,2]
## [1,] 2 4
## [2,] 6 8
##
## , , 3
##
## [,1] [,2]
## [1,] 3 6
## [2,] 9 12
Most of us don’t have to use arrays that often, so we’ll leave it at that.
2.5. Data Frames
The data.frame
object is like a matrix in that it has 2 dimensions, but it can contain heterogeneous data like a list. The indexing works like a sort of hybrid of the two as you’ll soon see.
2.5.1 Creating Data Frames
Let’s start by making a simple dataframe out of the mixed_mat
we created above:
df_1 <- data.frame(mixed_mat)
df_1
## c1 c2 c3
## r1 1 2 a
## r2 3 4 b
Note that we can also create a dataframe very similarly to how we made a list, but replacing list()
with data.frame()
:
df_1 <- data.frame(c1 = c(1, 3),
c2 = c(2, 4),
c3 = c("a", "b"),
row.names = c("r1", "r2"))
df_1
## c1 c2 c3
## r1 1 2 a
## r2 3 4 b
2.5.2 Indexing Data Frames
Indexing dataframes is sort of a hybrid between matrices and lists. Like matrices, you can index using the [row, column] format, using either numerical indexing:
df_1[1, 3]
## [1] "a"
or names:
df_1["r1", "c3"]
## [1] "a"
You can also get an entire row or column by leaving an index blank. Let’s get all rows for column 2:
df_1[, "c2"]
## [1] 2 4
However, like a list, we can use the $
in the form data$column
(similar to list$object
used for lists). Let’s get the first column:
df_1$c1
## [1] 1 3
And, like with lists, we can index a column using vector indexing, since a single column is (by definition) a vector (1-dimensional, with a single type of data). Let’s get the first value in column 1:
df_1$c1[1]
## [1] 1
2.5.2.1 Modifying a Dataframe with Indexes
Just like lists, vectors, and matrices, you can modify a dataframe and add new elements or change exising elements by referencing indexes.
We could create c4 as the sum of c1 and c2:
df_1$c4 <- df_1$c1 + df_1$c2
df_1
## c1 c2 c3 c4
## r1 1 2 a 3
## r2 3 4 b 7
Or we could replace an element using indexing too. Let’s replace c1 with c1^2:
df_1$c1 <- df_1$c1^2
df_1
## c1 c2 c3 c4
## r1 1 2 a 3
## r2 9 4 b 7
2.5.3 Structure, Types, & Coercion in Dfs
How can we tell what type of data each column stores? We can use the str()
function to get the structure of the data:
str(df_1)
## 'data.frame': 2 obs. of 4 variables:
## $ c1: num 1 9
## $ c2: num 2 4
## $ c3: chr "a" "b"
## $ c4: num 3 7
You can also coerce individual columns into other types. For example, we could turn c1 into an integer (from a numeric/double):
df_1$c1 <- as.integer(df_1$c1)
df_1
## c1 c2 c3 c4
## r1 1 2 a 3
## r2 9 4 b 7
str(df_1)
## 'data.frame': 2 obs. of 4 variables:
## $ c1: int 1 9
## $ c2: num 2 4
## $ c3: chr "a" "b"
## $ c4: num 3 7
2.5.2a
Make a data frame, called df_2, that has 3 columns.
c1 = 1, 2, 3 c2 = 2, 4, 6, c3 = “a”, “b”, “c”
After you create it, check the structure.
2.5.2b
Using df_2, add a fourth column, c4, which is the first and second columns multiplied together.
2.5.2c
Using df_2, try to add the first and second rows together. what happens? Why?
2.6 Wrapping up Data Types
Now you know the basics of how R represents data.
This table from Advanced R summarizes them:
Homogenous data | Heterogenous data | |
---|---|---|
1-Dimensional | Atomic Vector | List |
2-Dimensional | Matrix | Data frame |
N-Dimensional | Array |
3. Functions & Iteration
3.1. Basic Functions
Data (and objects more generally) are one of the building blocks of R. The other is functions. We’ve already used a handful of functions, including seq()
, arithmetic functions (+
, *
, etc.), matrix()
, array()
, etc.
Functions take some form of an input, perform some operation, and then return some object(s) as output. Functions are made up of arguments. For example, let’s take a look at the help documentation for seq()
by copying ?seq into the console.
You can see it has the arguments from
, to
, by
, length.out
, and along.with
. You might also notice that each of the arguments have a value after the =
in the documentation. These values are the defaults; they are what the arguments will be set to if you don’t specify them. In fact, since all of the arguments have defaults, we don’t have to specify any to run seq()
as we saw earlier today.
seq()
## [1] 1
Let’s take a look at a new function, mean()
. This is used to get the mean of a set of numbers. What happens if we run it without any arguments?
?mean()
We get an error telling us that the argument “x” is missing and has no default. Whenever you see this error, it means you are missing a required argument (i.e., an argument without a default). If we look at the documention, you can see x is the data from which to calculate a mean.
Let’s see what happens when we run mean()
on the matrix mat
we created eariler:
mean(mat)
## [1] 2.5
Note that mean()
has two more optional arguments listed: trim
, which returns a trimmed mean, and na.rm
which takes a logical value indicating if it should remove missing values or not before it calculates the mean (set to FALSE
by default). What happens if we don’t remove NAs before calculating the mean? Let’s check it out.
vec_w_na <- c(1, 2, 3, 4, 5, 6, NA, 2, 4)
mean(vec_w_na)
## [1] NA
It returns NA
. This is important to remember: NAs are contagious! One in a vector will cause many functions to return NA (unless they remove them by default). This sort of makes sense - the mean of vec_w_na
in its entirity is unknown, since we don’t know what the NA
value is. That’s why you have to remove them by setting na.rm = TRUE
mean(vec_w_na, na.rm = TRUE)
## [1] 3.375
sd()
can be used to get the standard deviation and is structured very similarly:
sd(vec_w_na, na.rm = TRUE)
## [1] 1.685018
You can get the length of many objects with length()
:
length(vec_w_na)
## [1] 9
nrow()
and ncol()
can be used to get the number of rows or columns in a matrix or data frame:
nrow(df_1)
## [1] 2
ncol(df_1)
## [1] 4
nrow(mat)
## [1] 2
ncol(mat)
## [1] 2
3.2. Arguments are Positional
Take a look at the sd()
function documentation (?sd
). Notice that there are two arguments and they are in order, x
followed by na.rm = FALSE
. You can set arguments by name:
sd(x = vec_w_na, na.rm = TRUE)
## [1] 1.685018
You can also set them positionally:
sd(vec_w_na, TRUE)
## [1] 1.685018
When using arguments positionally (without their names), you need to make sure the arguments are in the right order. If you name the arguments, you can actually put them in a different order:
sd(na.rm = TRUE, x = vec_w_na)
## [1] 1.685018
Now let’s try the mean()
function:
mean(vec_w_na, TRUE)
3.2.Q1:
What happenned?
A:
mean(vec_w_na, na.rm = TRUE) # or...
## [1] 3.375
mean(vec_w_na, 0, TRUE)
## [1] 3.375
3.3. User-defined functions
You can also make your own functions. In fact, that is where most of the R functions we use come from, someone making their own function and then sharing it with everyone. For example, let’s make a function that calculates the mean of a vector, and prints it in the format: "The mean is _", where the blank is filled in with the mean.
To create a function, you use the function()
function. Like anything you make in R, you need to save it to a name with <-
.
Within the parantheses, you put the arguments. Then you define the body of the function in {}
:
pretty_mean <- function(data){
mean <- mean(data, na.rm = TRUE)
x <- paste("The Mean is", mean)
print(x)
}
Now let’s test it out on our vec_w_na
object:
pretty_mean(vec_w_na)
## [1] "The Mean is 3.375"
Notice that it works how we expected it to, and also it didn’t create an object in our global environment called mean
. That variables within a function are local; they only exist within that function (they don’t get created in the global environment).
pretty_mean <- function(data, excited = TRUE){
mean <- mean(data, na.rm = TRUE)
if(excited == FALSE){x <- paste("The Mean is", mean)}
if(excited == TRUE){x <- paste("The Mean is", mean, "!")}
print(x)
}
pretty_mean(vec_w_na)
## [1] "The Mean is 3.375 !"
pretty_mean(vec_w_na, excited = FALSE) # tone it down
## [1] "The Mean is 3.375"
Obviously, we can make much more complicated functions, but we will leave it there for now.
Exercise 3.3a
Create a function called
pretty_sd
, that has a single argument called data and prints out "The SD is _" with the blank filled in with the SD of the data. Run it on thevec_w_na
object.
3.4 For loops
R, like many programming languages, can loop through objects. The most common type of loop is called a for loop, and it iterates through an object performing a specified operation. For example, let’s say we wanted to go through our vec_w_na
and print each element and its index. We could do that with a for loop.
For loops work a little bit like a function. Within the parantheses, you specify what is being iterated through, and you specify the bodfy of the loop in {}
:
for(i in 1:length(vec_w_na)){
print(paste("The value at index", i, "is", vec_w_na[i]))
}
## [1] "The value at index 1 is 1"
## [1] "The value at index 2 is 2"
## [1] "The value at index 3 is 3"
## [1] "The value at index 4 is 4"
## [1] "The value at index 5 is 5"
## [1] "The value at index 6 is 6"
## [1] "The value at index 7 is NA"
## [1] "The value at index 8 is 2"
## [1] "The value at index 9 is 4"
Note that with a for loop, we have to tell it to print the result; otherwise, it runs the commands within the scope of the loop and doesn’t display the results.
Exercise 3.4a.
Create a for loop that iterates through our
vec_w_na
vector, and prints out the result of dividing each value by 5 (hint: remember to print!).
3.5. Packages
So far, we’ve been working with functions that are already installed and loaded when we open R. However, many of the functions we want to use are not part of the basic R install. They come in packages
that other R users make and share.
3.5.1 Installing Packages
The most common way to get a package is to download it from CRAN using the code install.packages()
function. Packages on CRAN have to pass certain tests, so any package on CRAN should be safe and basically work (there can be bugs).
One package we’re going to use is the package rio
, which has really easy functions for importing and exporting data. To install a package, you want to run install.packages("package_name")
.
install.packages("rio")
3.5.2 Updating Packages
You occassionally need to update packages, which you can do with update.packages()
:
update.packages("rio")
3.5.3 Opening Packages
Installing a package puts a copy of it into our personal library that R has access to. If we want to use it, we have to open the package in our working session. We do this with the library()
function:
library(rio)
library("rio") # also works
Now that we’ve opened rio
, we have access to its functions like import()
and export()
.
What happenes when we open a package? Loading a package basically makes the contents of it searchable by R. This makes it so it can find the functions a particular package has. You can see that by running the search()
command:
search()
## [1] ".GlobalEnv" "package:rio" "package:stats"
## [4] "package:graphics" "package:grDevices" "package:utils"
## [7] "package:datasets" "package:methods" "Autoloads"
## [10] "package:base"
Let’s install another package we’re going to use quite a bit, which is the tidyverse
. The tidyverse isn’t exactly a package, it’s a set of 8 packages and a broader coding framework for R. We’ll get into that a bit at the end of today and for the next two sessions, but let’s get the package downloaded now:
install.packages("tidyverse")
And let’s load it:
library(tidyverse)
## -- Attaching packages ------------------------------------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.1 v purrr 0.3.4
## v tibble 3.0.1 v dplyr 0.8.5
## v tidyr 1.1.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ---------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
3.5.4 Conflicts
You’ll notice that when we load the tidyverse, it tells us the 8 different packages it loads. It also tells us that we have conflicts with other packages.
Conflicts occur when the same name is used for different things. For example, the dplyr
package and the stats
package (preloaded package) both have a function called filter()
. When we call filter()
, R will only call one of those functions and it might not be the one we want.
Which one will R choose? R has an order in which it searches. It starts with the Global Environment, then searches packages in the order that they were loaded, searching more recently loaded packages first. You can actually see the order by using the search()
command:
search()
## [1] ".GlobalEnv" "package:forcats" "package:stringr"
## [4] "package:dplyr" "package:purrr" "package:readr"
## [7] "package:tidyr" "package:tibble" "package:ggplot2"
## [10] "package:tidyverse" "package:rio" "package:stats"
## [13] "package:graphics" "package:grDevices" "package:utils"
## [16] "package:datasets" "package:methods" "Autoloads"
## [19] "package:base"
As you can see, the tidyverse packages come before rio, which comes before the pre-loaded packages (rstudio, stats, etc.).
The last thing I want to mention about packages is that you can tell R explicitly that you want an object from a particular package using the notation package::object
, or more commonly package::function()
. For example, run the two pieces of code below:
?stats::filter
?dplyr::filter
This can be useful if you have conflicting packages and want to make sure you’re calling the right function.