rm(list = ls()) # clean-up workspace

R’s data structures

Homogeneous Heterogeneous
1d Atomic vector List
2d Matrix Data frame
nd Array

Vectors

Note: is.vector() does not test if an object is a vector. Use is.atomic() or is.list() to test.

Atomic vectors

  • There are four common types of atomic vectors (remember Lab 2?)

    • logical

    • integer

    • numeric (actually double)

    • character

Many commands in R generate a vector of output, rather than a single number.

The c() command: creates a vector containing a list of specific elements.

Example 1

c(7, 3, 6, 0)
## [1] 7 3 6 0
c(73:60)
##  [1] 73 72 71 70 69 68 67 66 65 64 63 62 61 60
c(7:3, 6:0)
##  [1] 7 6 5 4 3 6 5 4 3 2 1 0
c(rep(7:3, 6), 0)
##  [1] 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 0

Example 2 The command seq() creates a sequence of numbers.

seq(7)
## [1] 1 2 3 4 5 6 7
seq(3, 70, by = 6)
##  [1]  3  9 15 21 27 33 39 45 51 57 63 69
seq(3, 70, length = 6)
## [1]  3.0 16.4 29.8 43.2 56.6 70.0
  • Atomic vectors are always flat, even if you nest c()’s:

Example 3

c(1, c(2, c(3, 4)))
## [1] 1 2 3 4

Lists

  • Elements can be of any type, including lists.

  • Construct list by using list() instead of c().

x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x)
## List of 4
##  $ : int [1:3] 1 2 3
##  $ : chr "a"
##  $ : logi [1:3] TRUE FALSE TRUE
##  $ : num [1:2] 2.3 5.9
  • Can be named, can access by name with $.
x.named <- list(vector = 1:3, name = "a", logical = c(TRUE, FALSE, TRUE), range = c(2.3, 5.9))
str(x.named)
## List of 4
##  $ vector : int [1:3] 1 2 3
##  $ name   : chr "a"
##  $ logical: logi [1:3] TRUE FALSE TRUE
##  $ range  : num [1:2] 2.3 5.9
x.named$vector
## [1] 1 2 3
x.named$range
## [1] 2.3 5.9
  • Lists are used to build up many of the more complicated data structures in R.

  • For example, both data frames (another data structure in R) and linear models objects (as produced by lm()) are lists.

Attributes

  • All objects can have arbitrary additional attributes to store metadata about the object.

  • Attributes can be thought as a named list.

  • Use attr() to access individual attribute or attributes() to access all attributes as a list.

  • By default, most attributes are lost when modifying a vector. Only the most important ones stay:

    • Names, a character vector giving each element a name.

    • Dimensions, used to turn vectors into matrices and arrays.

    • Class, used to implement S3 object system.

y <- 1:10
attr(y, "my_attribute") <- "This is a vector"
attr(y, "my_attribute")
## [1] "This is a vector"
str(y)
##  int [1:10] 1 2 3 4 5 6 7 8 9 10
##  - attr(*, "my_attribute")= chr "This is a vector"
str(attributes(y))
## List of 1
##  $ my_attribute: chr "This is a vector"

Factors

  • A factor is a vector that can contain only predefined values and is used to store categorical data.

  • Built upon integer vectors using two attributes:

    • the class, “factor”: makes them behave differently from regular integer vectors

    • the levels: defines the set of allowed values

  • Sometimes when a data frame is read directly from a file, you may get a column of factor instead of numeric because of non-numeric value in the column (e.g. missing value encoded specially)

    • Possible remedy: coerce the vector from a factor to a character vecctor, and then from a character to a double vector

    • Better use na.strings argument to read.csv() function

Matrices and arrays

  • adding a dim attribute to an atomic vector allows it to behave like a multi-dimensional array

  • matrix is a special case of array

  • matrix() command creates a matrix from the given set of values

# Two scalar arguments to specify rows and columns
a <- matrix(1:6, ncol = 3, nrow = 2)
# One vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))

# You can also modify an object in place by setting dim()
c <- 1:6
dim(c) <- c(3, 2)
c
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
dim(c) <- c(2, 3)
c
##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Exercise Write a command to generate a random permutation of the numbers between 1 and 5 and save it to an object.

set.seed(7360)  # the course seed number
order(runif(5))
## [1] 3 5 2 4 1
sample(1:5, 5)
## [1] 2 1 5 3 4

Data frames

  • Most common way of storing data in R

  • A list of equal-length vectors

  • 2-dimensional structure, shares properties of both matrix and list

    • has attributes, names(), colnames() and rownames()

    • length() of a data frame is the length of the underlying list, same as ncol()

  • We will focus more on tibble, a data frame, but more.


Functions

Function components

  • All R functions have three parts:

    • the formals(), the list of arguments which controls how you can call the function

    • the body(), the code inside the function

    • the environment(), the “map” of the location of the function’s variables

f <- function(x) x^2
f
## function(x) x^2
formals(f)
## $x
body(f)
## x^2
environment(f)
## <environment: R_GlobalEnv>

Define a function

DoNothing <- function() {
  return(invisible(NULL))
}
DoNothing()

Invoke a function

mean(1:10, na.rm = TRUE)
## [1] 5.5
args <- list(1:10, na.rm = TRUE)
do.call(mean, args)
## [1] 5.5

Lexical scoping

Name masking

  • Names defined inside a function mask names defined outside a function
x <- 10
y <- 20

g02 <- function(){
   x <- 1  # a local variable to the function
   y <- 2
   c(x, y)
}
g02()
## [1] 1 2
  • If a name isn’t defined inside a function, R looks one level up.
x <- 2
g03 <- function() {
   y <- 1
   c(x, y)
}
g03()
## [1] 2 1
y
## [1] 20
  • R searches inside the current function, then looks where the function is defined and so on, all the way up to the global environment.

  • Finally, R looks in other loaded packages.

y <- 10

f <- function(x) {
   y <- 2
   y^2 + g(x)
}

g <- function(x) {
   x * y
}

What is the value of f(3)?

functions versus variables

  • In R, functions are ordinary objects. This means the scoping rules described above also apply to functions.

  • However, rules get complicated when functions and non-functions share the same name.

  • Better avoid assigning same names to objects

A fresh start

  • Every time a function is called a new environment is created to host its execution.
rm(a) # just in case...
g11 <- function() {
  if (!exists("a")) {
    a <- 1
  } else {
    a <- a + 1
  }
  a
}

g11()
## [1] 1
g11()
## [1] 1

What happens if we do

a <- 1:5
g11()
g11()

Dynamic lookup

  • Lexical scoping determines where to look for values.

  • R looks for values when the function is run, not when the function is created.

g12 <- function() x + 1
x <- 15
g12()
## [1] 16
x <- 20
g12()
## [1] 21
  • Depending on variables defined in the global environment can be bad!

  • codetools::findGlobals() can be helpful

Default arguments

  • You can define default values for arguments

  • Default values can be in terms of other arguments, or even in terms of variables defined later in the function

  • This is because R uses Lazy Evaluation that function arguments are only evaluated if accessed.

h04 <- function(x = 1, y = x * 2, z = a + b) {
  a <- 10
  b <- 100
  
  c(x, y, z)
}

h04()
## [1]   1   2 110

... (dot-dot-dot)

  • Functions can have a special argument ...

  • With ..., a function can take any number of additional arguments

  • You can use ... to pass those additional arguments on to another function

Pro

  • If your function takes a function as an argument, you want some way to pass additional arguments to that function.
x <- list(c(1, 3, NA), c(4, NA, 6))
str(lapply(x, mean, na.rm = TRUE))   
## List of 2
##  $ : num 2
##  $ : num 5

Con

  • A misspelled argument will not raise an error.
sum(1, 2, NA, na_rm = TRUE)
## [1] NA

Control flow

These are the basic control-flow constructs of the R language. They function in much the same way as control statements in any Algol-like (Algol short for “Algorithmic Language”) language. They are all reserved words.

keyword usage
if if(cond) expr
if-else if(cond) cons.expr else alt.expr
for for(var in seq) expr
while while(cond) expr
break breaks out of a for loop
next halts the processing of the current iteration and advances the looping index

Exiting a function

Most functions exit in one of two ways:

Implicit versus explicit returns

There are two ways that a function can return a value:

  • Implicitly, where the last evaluated expression is the return value:
j01 <- function(x) {
  if (x < 10) {
    0
  } else {
    10
  }
}
j01(5)
## [1] 0
j01(15)
## [1] 10
  • Explicitly, by calling return()
j02 <- function(x) {
  if (x < 10) {
    return(0)
  } else {
    return(10)
  }
}
  • You can hide the output from automatic printing by applying invisible() to the last value:
j04 <- function() invisible(1)
j04()

Errors

If a function cannot complete its assigned task, it should throw an error with stop(), which immediately terminates the execution of the function.

j05 <- function() {
  stop("I'm an error")
  return(10)
}
j05()
## Error in j05(): I'm an error

Exit handlers

  • Use on.exit() to set up an exit handler that is run regardless of whether the function exits normally or with an error

  • Always set add = TRUE when using on.exit(). Otherwise, each call will overwrite the previous exit handler.

j06 <- function(x) {
  cat("Hello\n")
  on.exit(cat("Goodbye!\n"), add = TRUE)
  
  if (x) {
    return(10)
  } else {
    stop("Error")
  }
}

j06(TRUE)
## Hello
## Goodbye!
## [1] 10
j06(FALSE)
## Hello
## Error in j06(FALSE): Error
## Goodbye!
  • Can use exit handler for clean-up
with_dir <- function(dir, code) {
  old <- setwd(dir)
  on.exit(setwd(old), add = TRUE)

  code
}

getwd()
## [1] "/Users/xji3/Dropbox/My_Files/Tulane/Teaching/tulane-math-7360-2023.github.io/lectures/06-Data_structure"
with_dir("~", getwd())
## [1] "/Users/xji3"
getwd()
## [1] "/Users/xji3/Dropbox/My_Files/Tulane/Teaching/tulane-math-7360-2023.github.io/lectures/06-Data_structure"