broglar is an R package that helps browse over longitudinal data graphically and analytically in R.
individuals repeatedly measured through time
brolgar
Install from GitHub:
# install.packages("remotes")
remotes::install_github("njtierney/brolgar")
## Downloading GitHub repo njtierney/brolgar@HEAD
## withr (2.5.0 -> 2.5.1 ) [CRAN]
## vctrs (0.6.2 -> 0.6.3 ) [CRAN]
## cpp11 (0.4.3 -> 0.4.6 ) [CRAN]
## Rcpp (1.0.10 -> 1.0.11 ) [CRAN]
## labeling (0.4.2 -> 0.4.3 ) [CRAN]
## digest (0.6.31 -> 0.6.33 ) [CRAN]
## gtable (0.3.3 -> 0.3.4 ) [CRAN]
## ggplot2 (3.4.2 -> 3.4.3 ) [CRAN]
## numDeriv (NA -> 2016.8-1.1) [CRAN]
## purrr (1.0.1 -> 1.0.2 ) [CRAN]
## dplyr (1.1.2 -> 1.1.3 ) [CRAN]
## anytime (NA -> 0.3.9 ) [CRAN]
## progressr (NA -> 0.14.0 ) [CRAN]
## distribut... (NA -> 0.3.2 ) [CRAN]
## tsibble (NA -> 1.1.3 ) [CRAN]
## fabletools (NA -> 0.3.3 ) [CRAN]
## Installing 16 packages: withr, vctrs, cpp11, Rcpp, labeling, digest, gtable, ggplot2, numDeriv, purrr, dplyr, anytime, progressr, distributional, tsibble, fabletools
## Installing packages into '/Users/xji3/Library/R/arm64/4.3/library'
## (as 'lib' is unspecified)
##
## The downloaded binary packages are in
## /var/folders/1h/__k6zw4s15l3cb673l1cdkkc0000gn/T//RtmpA33OQ0/downloaded_packages
## ── R CMD build ─────────────────────────────────────────────────────────────────
##
checking for file ‘/private/var/folders/1h/__k6zw4s15l3cb673l1cdkkc0000gn/T/RtmpA33OQ0/remotesaab16ae6e562/njtierney-brolgar-dabc060/DESCRIPTION’ ...
✔ checking for file ‘/private/var/folders/1h/__k6zw4s15l3cb673l1cdkkc0000gn/T/RtmpA33OQ0/remotesaab16ae6e562/njtierney-brolgar-dabc060/DESCRIPTION’
##
─ preparing ‘brolgar’:
##
checking DESCRIPTION meta-information ...
✔ checking DESCRIPTION meta-information
##
─ checking for LF line-endings in source and make files and shell scripts
##
─ checking for empty or unneeded directories
##
─ building ‘brolgar_1.0.0.9000.tar.gz’
##
##
## Installing package into '/Users/xji3/Library/R/arm64/4.3/library'
## (as 'lib' is unspecified)
Load tidyverse
, brolgar
and
gghighlight
:
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.3 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.3 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(brolgar)
library(gghighlight)
The csv file wages_pp.csv
contains a data set from the
textbook Applied Longitudical Data Analysis (2003) by Singer
and Willett. This data contains measurements on hourly wages by years in
the workforce, with education and race as covariates.
head wages_pp.csv
## "id","lnw","exper","ged","postexp","black","hispanic","hgc","hgc.9","uerate","ue.7","ue.centert1","ue.mean","ue.person.cen","ue1"
## 31,1.491,0.015,1,0.015,0,1,8,-1,3.215,-3.785,0,3.215,0,3.215
## 31,1.433,0.715,1,0.715,0,1,8,-1,3.215,-3.785,0,3.215,0,3.215
## 31,1.469,1.734,1,1.734,0,1,8,-1,3.215,-3.785,0,3.215,0,3.215
## 31,1.749,2.773,1,2.773,0,1,8,-1,3.295,-3.705,0.08,3.215,0.08,3.215
## 31,1.931,3.927,1,3.927,0,1,8,-1,2.895,-4.105,-0.32,3.215,-0.32,3.215
## 31,1.709,4.946,1,4.946,0,1,8,-1,2.495,-4.505,-0.72,3.215,-0.72,3.215
## 31,2.086,5.965,1,5.965,0,1,8,-1,2.595,-4.405,-0.62,3.215,-0.62,3.215
## 31,2.129,6.984,1,6.984,0,1,8,-1,4.795,-2.205,1.58,3.215,1.58,3.215
## 36,1.982,0.315,1,0.315,0,0,9,0,4.895,-2.105,0,5.0965,-0.201500000000000,4.895
wages
## # A tsibble: 6,402 x 9 [!]
## # Key: id [888]
## id ln_wages xp ged xp_since_ged black hispanic high_grade
## <int> <dbl> <dbl> <int> <dbl> <int> <int> <int>
## 1 31 1.49 0.015 1 0.015 0 1 8
## 2 31 1.43 0.715 1 0.715 0 1 8
## 3 31 1.47 1.73 1 1.73 0 1 8
## 4 31 1.75 2.77 1 2.77 0 1 8
## 5 31 1.93 3.93 1 3.93 0 1 8
## 6 31 1.71 4.95 1 4.95 0 1 8
## 7 31 2.09 5.96 1 5.96 0 1 8
## 8 31 2.13 6.98 1 6.98 0 1 8
## 9 36 1.98 0.315 1 0.315 0 0 9
## 10 36 1.80 0.983 1 0.983 0 0 9
## # ℹ 6,392 more rows
## # ℹ 1 more variable: unemploy_rate <dbl>
Read following variables into a tibble:
- id
: 1-888, for each subject.
- lnw
: natural log of wages, adjusted for inflation, to
1990 dollars.
- exper
: Experience - the length of time in the workforce
(in years). The number of time points and values of time points for each
subject can differ.
- ged
: when/if a graduate equivalency diploma is
obtained.
- postexp
: change in experience since getting a ged (if
they get one).
- black
: categorical indicator of race = black.
- hispanic
: categorical indicator of race = hispanic.
- hgc
: highest grade completed.
- uerate
: unemployment rates in the local geographic region
at each measurement time.
wages <- read_csv("wages_pp.csv") %>% select(1:8, 10)
## Rows: 6402 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (15): id, lnw, exper, ged, postexp, black, hispanic, hgc, hgc.9, uerate,...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
wages %>% print(width = Inf)
## # A tibble: 6,402 × 9
## id lnw exper ged postexp black hispanic hgc uerate
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 31 1.49 0.015 1 0.015 0 1 8 3.22
## 2 31 1.43 0.715 1 0.715 0 1 8 3.22
## 3 31 1.47 1.73 1 1.73 0 1 8 3.22
## 4 31 1.75 2.77 1 2.77 0 1 8 3.30
## 5 31 1.93 3.93 1 3.93 0 1 8 2.90
## 6 31 1.71 4.95 1 4.95 0 1 8 2.50
## 7 31 2.09 5.96 1 5.96 0 1 8 2.60
## 8 31 2.13 6.98 1 6.98 0 1 8 4.80
## 9 36 1.98 0.315 1 0.315 0 0 9 4.89
## 10 36 1.80 0.983 1 0.983 0 0 9 7.4
## # ℹ 6,392 more rows
tsibble
We turn a regular data frame/tibble into a tidy time series data
frame tsibble
by identifying
1. The key variable is the identifier of
individuals.
2. The index variable is the time component of your
data.
3. The regularity of the time interveral (index).
Longitudinal data typically has irregular time periods between
measurements, but can have regular measurements.
wages <- as_tsibble(x = wages,
key = id,
index = exper,
regular = FALSE)
wages
## # A tsibble: 6,402 x 9 [!]
## # Key: id [888]
## id lnw exper ged postexp black hispanic hgc uerate
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 31 1.49 0.015 1 0.015 0 1 8 3.22
## 2 31 1.43 0.715 1 0.715 0 1 8 3.22
## 3 31 1.47 1.73 1 1.73 0 1 8 3.22
## 4 31 1.75 2.77 1 2.77 0 1 8 3.30
## 5 31 1.93 3.93 1 3.93 0 1 8 2.90
## 6 31 1.71 4.95 1 4.95 0 1 8 2.50
## 7 31 2.09 5.96 1 5.96 0 1 8 2.60
## 8 31 2.13 6.98 1 6.98 0 1 8 4.80
## 9 36 1.98 0.315 1 0.315 0 0 9 4.89
## 10 36 1.80 0.983 1 0.983 0 0 9 7.4
## # ℹ 6,392 more rows
sample_n_keys()
and sample_frac_keys()
to
take a random sample of keys.
set.seed(203)
wages %>%
sample_n_keys(size = 5) %>%
ggplot(aes(x = exper,
y = lnw,
group = id)) +
geom_line()
facet_sample()
allows you to specify the number of keys
per facet, and the number of facets with n_per_facet
and
n_facets
. By default, it splits the data into 12 facets
with 3 per facet:
set.seed(203)
ggplot(wages,
aes(x = exper,
y = lnw,
group = id)) +
geom_line() +
facet_sample()
Five number summaries of wages for each individual:
wages.summary <- wages %>%
features(lnw, feat_five_num)
wages.summary %>% print(width = Inf)
## # A tibble: 888 × 6
## id min q25 med q75 max
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 31 1.43 1.48 1.73 2.02 2.13
## 2 36 1.80 1.97 2.32 2.59 2.93
## 3 53 1.54 1.58 1.71 1.89 3.24
## 4 122 0.763 2.10 2.19 2.46 2.92
## 5 134 2.00 2.28 2.36 2.79 2.93
## 6 145 1.48 1.58 1.77 1.89 2.04
## 7 155 1.54 1.83 2.22 2.44 2.64
## 8 173 1.56 1.68 2.00 2.05 2.34
## 9 206 2.03 2.07 2.30 2.45 2.48
## 10 207 1.58 1.87 2.15 2.26 2.66
## # ℹ 878 more rows
Other feature functions include feat_monotonic
,
feat_ranges
, feat_spread
,
feat_three_num
, n_obs
. For example, find those
whose wages only increase or decrease with feat_monotonic:
wages %>%
features(lnw, feat_monotonic)
## # A tibble: 888 × 5
## id increase decrease unvary monotonic
## <dbl> <lgl> <lgl> <lgl> <lgl>
## 1 31 FALSE FALSE FALSE FALSE
## 2 36 FALSE FALSE FALSE FALSE
## 3 53 FALSE FALSE FALSE FALSE
## 4 122 FALSE FALSE FALSE FALSE
## 5 134 FALSE FALSE FALSE FALSE
## 6 145 FALSE FALSE FALSE FALSE
## 7 155 FALSE FALSE FALSE FALSE
## 8 173 FALSE FALSE FALSE FALSE
## 9 206 TRUE FALSE FALSE TRUE
## 10 207 FALSE FALSE FALSE FALSE
## # ℹ 878 more rows
Find the number of observations per individual and summarize by bar plot:
wages %>%
features(lnw, n_obs) %>%
ggplot(aes(x = n_obs)) +
geom_bar()
Once you have created these features, you can join them back to the
data with a left_join
:
wages %>%
features(lnw, feat_monotonic) %>%
left_join(wages, by = "id") %>%
ggplot(aes(x = exper,
y = lnw,
group = id)) +
geom_line() +
gghighlight(increase)
## Warning: Tried to calculate with group_by(), but the calculation failed.
## Falling back to ungrouped filter operation...
## label_key: id
## Too many data series, skip labeling