Data Structure

rm(list = ls()) # clean-up workspace

R’s data structures

	Homogeneous	Heterogeneous
1d	Atomic vector	List
2d	Matrix	Data frame
nd	Array

Homogeneous: all contents must be of the same type
Heterogeneous: the contents can be of different types

Vectors

The basic data structure in R.
Two flavors: atomic vectors and lists
Three common properties:
- Type, typeof(), what it is.
- Length, length(), how many elements it contains.
- Attributes, attributes(), additional arbitrary metadata.
No scalars in R. They are length 1 vectors.

Note: is.vector() does not test if an object is a vector. Use is.atomic() or is.list() to test.

Atomic vectors

There are four common types of atomic vectors (remember Lab 2?)
- logical
- integer
- numeric (actually double)
- character

Many commands in R generate a vector of output, rather than a single number.

The c() command: creates a vector containing a list of specific elements.

Example 1

c(7, 3, 6, 0)

## [1] 7 3 6 0

c(73:60)

##  [1] 73 72 71 70 69 68 67 66 65 64 63 62 61 60

c(7:3, 6:0)

##  [1] 7 6 5 4 3 6 5 4 3 2 1 0

c(rep(7:3, 6), 0)

##  [1] 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 7 6 5 4 3 0

Example 2 The command seq() creates a sequence of numbers.

seq(7)

## [1] 1 2 3 4 5 6 7

seq(3, 70, by = 6)

##  [1]  3  9 15 21 27 33 39 45 51 57 63 69

seq(3, 70, length = 6)

## [1]  3.0 16.4 29.8 43.2 56.6 70.0

Atomic vectors are always flat, even if you nest c()’s:

Example 3

c(1, c(2, c(3, 4)))

## [1] 1 2 3 4

Lists

Elements can be of any type, including lists.
Construct list by using list() instead of c().

x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x)

## List of 4
##  $ : int [1:3] 1 2 3
##  $ : chr "a"
##  $ : logi [1:3] TRUE FALSE TRUE
##  $ : num [1:2] 2.3 5.9

Can be named, can access by name with $.

x.named <- list(vector = 1:3, name = "a", logical = c(TRUE, FALSE, TRUE), range = c(2.3, 5.9))
str(x.named)

## List of 4
##  $ vector : int [1:3] 1 2 3
##  $ name   : chr "a"
##  $ logical: logi [1:3] TRUE FALSE TRUE
##  $ range  : num [1:2] 2.3 5.9

x.named$vector

## [1] 1 2 3

x.named$range

## [1] 2.3 5.9

Lists are used to build up many of the more complicated data structures in R.
For example, both data frames (another data structure in R) and linear models objects (as produced by lm()) are lists.

Attributes

All objects can have arbitrary additional attributes to store metadata about the object.
Attributes can be thought as a named list.
Use attr() to access individual attribute or attributes() to access all attributes as a list.
By default, most attributes are lost when modifying a vector. Only the most important ones stay:
- Names, a character vector giving each element a name.
- Dimensions, used to turn vectors into matrices and arrays.
- Class, used to implement S3 object system.

y <- 1:10
attr(y, "my_attribute") <- "This is a vector"
attr(y, "my_attribute")

## [1] "This is a vector"

str(y)

##  int [1:10] 1 2 3 4 5 6 7 8 9 10
##  - attr(*, "my_attribute")= chr "This is a vector"

str(attributes(y))

## List of 1
##  $ my_attribute: chr "This is a vector"

Factors

A factor is a vector that can contain only predefined values and is used to store categorical data.
Built upon integer vectors using two attributes:
- the class, “factor”: makes them behave differently from regular integer vectors
- the levels: defines the set of allowed values
Sometimes when a data frame is read directly from a file, you may get a column of factor instead of numeric because of non-numeric value in the column (e.g. missing value encoded specially)
- Possible remedy: coerce the vector from a factor to a character vecctor, and then from a character to a double vector
- Better use na.strings argument to read.csv() function

Matrices and arrays

adding a dim attribute to an atomic vector allows it to behave like a multi-dimensional array
matrix is a special case of array
matrix() command creates a matrix from the given set of values

# Two scalar arguments to specify rows and columns
a <- matrix(1:6, ncol = 3, nrow = 2)
# One vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))

# You can also modify an object in place by setting dim()
c <- 1:6
dim(c) <- c(3, 2)
c

##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6

dim(c) <- c(2, 3)
c

##      [,1] [,2] [,3]
## [1,]    1    3    5
## [2,]    2    4    6

Exercise Write a command to generate a random permutation of the numbers between 1 and 5 and save it to an object.

set.seed(7360)  # the course seed number
order(runif(5))

## [1] 3 5 2 4 1

sample(1:5, 5)

## [1] 2 1 5 3 4

Data frames

Most common way of storing data in R
A list of equal-length vectors
2-dimensional structure, shares properties of both matrix and list
- has attributes, names(), colnames() and rownames()
- length() of a data frame is the length of the underlying list, same as ncol()
We will focus more on tibble, a data frame, but more.

Functions

Functions are a fundamental building block of R
Functions are objects in their own right (so that they can have attributes())
Functional programming

Function components

All R functions have three parts:
- the formals(), the list of arguments which controls how you can call the function
- the body(), the code inside the function
- the environment(), the “map” of the location of the function’s variables

f <- function(x) x^2
f

## function(x) x^2

formals(f)

## $x

body(f)

## x^2

environment(f)

## <environment: R_GlobalEnv>

Define a function

There is no special syntax for defining and naming a function
simply create a function object (with function) and bind it to a name with <-

DoNothing <- function() {
  return(invisible(NULL))
}
DoNothing()

Invoke a function

You normally call a function by placing its arguments, wrapped in parentheses, after its name:

mean(1:10, na.rm = TRUE)

## [1] 5.5

What if you have the arguments already in a data structure?

args <- list(1:10, na.rm = TRUE)
do.call(mean, args)

## [1] 5.5

You can use do.call().

Lexical scoping

Now let’s discuss scoping
R uses lexical scoping that follows four primary rules:
- Name masking
- Functions versus variables
- A fresh start
- Dynamic lookup

Name masking

Names defined inside a function mask names defined outside a function

x <- 10
y <- 20

g02 <- function(){
   x <- 1  # a local variable to the function
   y <- 2
   c(x, y)
}
g02()

## [1] 1 2

If a name isn’t defined inside a function, R looks one level up.

x <- 2
g03 <- function() {
   y <- 1
   c(x, y)
}
g03()

## [1] 2 1

## [1] 20

R searches inside the current function, then looks where the function is defined and so on, all the way up to the global environment.
Finally, R looks in other loaded packages.

y <- 10

f <- function(x) {
   y <- 2
   y^2 + g(x)
}

g <- function(x) {
   x * y
}

What is the value of f(3)?

functions versus variables

In R, functions are ordinary objects. This means the scoping rules described above also apply to functions.
However, rules get complicated when functions and non-functions share the same name.
Better avoid assigning same names to objects

A fresh start

Every time a function is called a new environment is created to host its execution.

rm(a) # just in case...
g11 <- function() {
  if (!exists("a")) {
    a <- 1
  } else {
    a <- a + 1
  }
  a
}

g11()

## [1] 1

g11()

## [1] 1

What happens if we do

a <- 1:5
g11()
g11()

Dynamic lookup

Lexical scoping determines where to look for values.
R looks for values when the function is run, not when the function is created.

g12 <- function() x + 1
x <- 15
g12()

## [1] 16

x <- 20
g12()

## [1] 21

Depending on variables defined in the global environment can be bad!
codetools::findGlobals() can be helpful

Default arguments

You can define default values for arguments
Default values can be in terms of other arguments, or even in terms of variables defined later in the function
This is because R uses Lazy Evaluation that function arguments are only evaluated if accessed.

h04 <- function(x = 1, y = x * 2, z = a + b) {
  a <- 10
  b <- 100
  
  c(x, y, z)
}

h04()

## [1]   1   2 110

`...` (dot-dot-dot)

Functions can have a special argument ...
With ..., a function can take any number of additional arguments
You can use ... to pass those additional arguments on to another function

Pro

If your function takes a function as an argument, you want some way to pass additional arguments to that function.

x <- list(c(1, 3, NA), c(4, NA, 6))
str(lapply(x, mean, na.rm = TRUE))

## List of 2
##  $ : num 2
##  $ : num 5

Con

A misspelled argument will not raise an error.

sum(1, 2, NA, na_rm = TRUE)

## [1] NA

Control flow

These are the basic control-flow constructs of the R language. They function in much the same way as control statements in any Algol-like (Algol short for “Algorithmic Language”) language. They are all reserved words.

keyword	usage
if	if(cond) expr
if-else	if(cond) cons.expr else alt.expr
for	for(var in seq) expr
while	while(cond) expr
break	breaks out of a for loop
next	halts the processing of the current iteration and advances the looping index

Exiting a function

Most functions exit in one of two ways:

return a value, indicating success
throw an error, indicating failure

Implicit versus explicit returns

There are two ways that a function can return a value:

Implicitly, where the last evaluated expression is the return value:

j01 <- function(x) {
  if (x < 10) {
    0
  } else {
    10
  }
}
j01(5)

## [1] 0

j01(15)

## [1] 10

Explicitly, by calling return()

j02 <- function(x) {
  if (x < 10) {
    return(0)
  } else {
    return(10)
  }
}

You can hide the output from automatic printing by applying invisible() to the last value:

j04 <- function() invisible(1)
j04()

Errors

If a function cannot complete its assigned task, it should throw an error with stop(), which immediately terminates the execution of the function.

j05 <- function() {
  stop("I'm an error")
  return(10)
}
j05()

## Error in j05(): I'm an error

Exit handlers

Use on.exit() to set up an exit handler that is run regardless of whether the function exits normally or with an error
Always set add = TRUE when using on.exit(). Otherwise, each call will overwrite the previous exit handler.

j06 <- function(x) {
  cat("Hello\n")
  on.exit(cat("Goodbye!\n"), add = TRUE)
  
  if (x) {
    return(10)
  } else {
    stop("Error")
  }
}

j06(TRUE)

## Hello
## Goodbye!

## [1] 10

j06(FALSE)

## Hello

## Error in j06(FALSE): Error

## Goodbye!

Can use exit handler for clean-up

with_dir <- function(dir, code) {
  old <- setwd(dir)
  on.exit(setwd(old), add = TRUE)

  code
}

getwd()

## [1] "/Users/xji3/Dropbox/My_Files/Tulane/Teaching/tulane-math-7360-2023.github.io/lectures/06-Data_structure"

with_dir("~", getwd())

## [1] "/Users/xji3"

getwd()

## [1] "/Users/xji3/Dropbox/My_Files/Tulane/Teaching/tulane-math-7360-2023.github.io/lectures/06-Data_structure"

Data Structure

MATH-7360 Data Analysis

Dr. Xiang Ji @ Tulane University

Sep 06, 2023

R’s data structures

Vectors

Atomic vectors

Lists

Attributes

Factors

Matrices and arrays

Data frames

Functions

Function components

Define a function

Invoke a function

Lexical scoping

Name masking

functions versus variables

A fresh start

Dynamic lookup

Default arguments

`...` (dot-dot-dot)

Control flow

Exiting a function

Implicit versus explicit returns

Errors

Exit handlers

Data Structure

MATH-7360 Data Analysis

Dr. Xiang Ji @ Tulane University

Sep 06, 2023

R’s data structures

Vectors

Atomic vectors

Lists

Attributes

Factors

Matrices and arrays

Data frames

Functions

Function components

Define a function

Invoke a function

Lexical scoping

Name masking

functions versus variables

A fresh start

Dynamic lookup

Default arguments

... (dot-dot-dot)

Control flow

Exiting a function

Implicit versus explicit returns

Errors

Exit handlers

`...` (dot-dot-dot)