Longitudinal Data Structures
Source:vignettes/longitudinal-data-structures.Rmd
longitudinal-data-structures.Rmd
There are many ways to describe longitudinal data - from panel data, cross-sectional data, and time series. We define longitudinal data as:
Information from the same individuals, recorded at multiple points in time.
To explore and model longitudinal data, It is important to understand what variables represent the individual components, and the time components, and how these identify an individual moving through time. Identifying the individual and time components can sometimes be a challenge, so this vignette walks through how to do this.
Defining longitudinal data as a tsibble
The tools and workflows in brolgar
are designed to work
with a special tidy time series data frame called a
tsibble
. We can define our longitudinal data in terms of a
time series to gain access to some really useful tools. To do so, we
need to identify three components:
- The key variable in your data is the identifier of your individual.
- The index variable is the time component of your data.
- The regularity of the time interval (index). Longitudinal data typically has irregular time periods between measurements, but can have regular measurements.
Together, time index and key uniquely identify an observation with repeated measurements
The term key
is used a lot in brolgar, so it is an
important idea to internalise:
The key is the identifier of your individuals or series
Why care about defining longitudinal data as a time series? Once we account for this time series structure inherent in longitudinal data, we gain access to a suite of nice tools that simplify and accelerate how we work with time series data.
brolgar
is built on top of the powerful tsibble
package
by Earo Wang, if you would like to learn
more, see the official package
documentation or read the
paper.
Converting your longitudinal data to a time series
To convert longitudinal data into a “time
series tibble”, a tsibble
, we need
to consider which variables identify:
- The individual, who would have repeated measurements. This is the key
- The time component, this is the index .
- The regularity of the time interval (index).
Together, time index and key uniquely identify an observation with repeated measurements
The vignette now walks through some examples of converting
longitudinal data into a tsibble
.
example data: wages
Let’s look at the wages data analysed in Singer & Willett (2003). This data contains measurements on hourly wages by years in the workforce, with education and race as covariates. The population measured was male high-school dropouts, aged between 14 and 17 years when first measured. Below is the first 10 rows of the data.
library(brolgar)
suppressPackageStartupMessages(library(dplyr))
slice(wages, 1:10) %>% knitr::kable()
id | ln_wages | xp | ged | xp_since_ged | black | hispanic | high_grade | unemploy_rate |
---|---|---|---|---|---|---|---|---|
31 | 1.491 | 0.015 | 1 | 0.015 | 0 | 1 | 8 | 3.21 |
31 | 1.433 | 0.715 | 1 | 0.715 | 0 | 1 | 8 | 3.21 |
31 | 1.469 | 1.734 | 1 | 1.734 | 0 | 1 | 8 | 3.21 |
31 | 1.749 | 2.773 | 1 | 2.773 | 0 | 1 | 8 | 3.30 |
31 | 1.931 | 3.927 | 1 | 3.927 | 0 | 1 | 8 | 2.89 |
31 | 1.709 | 4.946 | 1 | 4.946 | 0 | 1 | 8 | 2.49 |
31 | 2.086 | 5.965 | 1 | 5.965 | 0 | 1 | 8 | 2.60 |
31 | 2.129 | 6.984 | 1 | 6.984 | 0 | 1 | 8 | 4.80 |
36 | 1.982 | 0.315 | 1 | 0.315 | 0 | 0 | 9 | 4.89 |
36 | 1.798 | 0.983 | 1 | 0.983 | 0 | 0 | 9 | 7.40 |
To create a tsibble
of the data we ask, “which variables
identify…”:
- The key, the individual, who would have repeated measurements.
- The index, the time component.
- The regularity of the time interval (index).
Together, time index and key uniquely identify an observation with repeated measurements
From this, we can say that:
- The key is the variable
id
- the subject id, from 1-888. - The index is the variable
xp
the experience in years an individual has. - The data is irregular since the experience is a fraction of year that is not an integer.
We can use this information to create a tsibble
of this
data using as_tsibble
library(tsibble)
as_tsibble(x = wages,
key = id,
index = xp,
regular = FALSE)
#> # A tsibble: 6,402 x 9 [!]
#> # Key: id [888]
#> id ln_wages xp ged xp_since_ged black hispanic high_grade
#> <int> <dbl> <dbl> <int> <dbl> <int> <int> <int>
#> 1 31 1.49 0.015 1 0.015 0 1 8
#> 2 31 1.43 0.715 1 0.715 0 1 8
#> 3 31 1.47 1.73 1 1.73 0 1 8
#> 4 31 1.75 2.77 1 2.77 0 1 8
#> 5 31 1.93 3.93 1 3.93 0 1 8
#> 6 31 1.71 4.95 1 4.95 0 1 8
#> 7 31 2.09 5.96 1 5.96 0 1 8
#> 8 31 2.13 6.98 1 6.98 0 1 8
#> 9 36 1.98 0.315 1 0.315 0 0 9
#> 10 36 1.80 0.983 1 0.983 0 0 9
#> # ℹ 6,392 more rows
#> # ℹ 1 more variable: unemploy_rate <dbl>
Note that regular = FALSE
, since we have an
irregular
time series
Note the following information printed at the top of
wages
# A tsibble: 6,402 x 9 [!]
# Key: id [888]
...
This says:
- We have 6402 rows,
- with 9 columns.
The !
at the top means that there is no regular spacing
between series
The “key” variable is then listed - id
, of which there
888.
example: heights data
The heights data is a little simpler than the wages data, and contains the average male heights in 144 countries from 1810-1989, with a smaller number of countries from 1500-1800.
It contains four variables:
- country
- continent
- year
- height_cm
To create a tsibble
of the data we ask, “which variables
identify…”:
- The key, the individual, who would have repeated measurements.
- The index, the time component.
- The regularity of the time interval (index).
In this case:
- The individual is not a person, but a country
- The time is year
- The year is not regular because there are not measurements at a fixed year point.
This data is already a tsibble
object, we can create a
tsibble
with the following code:
as_tsibble(x = heights,
key = country,
index = year,
regular = FALSE)
#> # A tsibble: 1,490 x 4 [!]
#> # Key: country [144]
#> country continent year height_cm
#> <chr> <chr> <dbl> <dbl>
#> 1 Afghanistan Asia 1870 168.
#> 2 Afghanistan Asia 1880 166.
#> 3 Afghanistan Asia 1930 167.
#> 4 Afghanistan Asia 1990 167.
#> 5 Afghanistan Asia 2000 161.
#> 6 Albania Europe 1880 170.
#> 7 Albania Europe 1890 170.
#> 8 Albania Europe 1900 169.
#> 9 Albania Europe 2000 168.
#> 10 Algeria Africa 1910 169.
#> # ℹ 1,480 more rows
example: gapminder
The gapminder R package contains a dataset of a subset of the gapminder study (link). This contains data on life expectancy, GDP per capita, and population by country.
library(gapminder)
gapminder
#> # A tibble: 1,704 × 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> 5 Afghanistan Asia 1972 36.1 13079460 740.
#> 6 Afghanistan Asia 1977 38.4 14880372 786.
#> 7 Afghanistan Asia 1982 39.9 12881816 978.
#> 8 Afghanistan Asia 1987 40.8 13867957 852.
#> 9 Afghanistan Asia 1992 41.7 16317921 649.
#> 10 Afghanistan Asia 1997 41.8 22227415 635.
#> # ℹ 1,694 more rows
Let’s identify
- The key, the individual, who would have repeated measurements.
- The index, the time component.
- The regularity of the time interval (index).
This is in fact very similar to the heights
dataset:
- The key is the country
- The index is the year
To identify if the year is regular, we can do a bit of data
exploration using index_summary()
gapminder %>%
group_by(country) %>%
index_summary(year)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 1952 1966 1980 1980 1993 2007
This shows us that the year is every five - so now we know that this is a regular longitudinal dataset, and can be encoded like so:
as_tsibble(gapminder,
key = country,
index = year,
regular = TRUE)
#> # A tsibble: 1,704 x 6 [5Y]
#> # Key: country [142]
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> 5 Afghanistan Asia 1972 36.1 13079460 740.
#> 6 Afghanistan Asia 1977 38.4 14880372 786.
#> 7 Afghanistan Asia 1982 39.9 12881816 978.
#> 8 Afghanistan Asia 1987 40.8 13867957 852.
#> 9 Afghanistan Asia 1992 41.7 16317921 649.
#> 10 Afghanistan Asia 1997 41.8 22227415 635.
#> # ℹ 1,694 more rows
example: PISA data
The PISA study measures school students around the world on a series of math, reading, and science scores. A subset of the data looks like so:
pisa
#> # A tibble: 433 × 11
#> country year math_mean math_min math_max read_mean read_min read_max
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ALB 2000 395. 27.4 722. 354. 59.7 640.
#> 2 ALB 2009 377. 79.6 706. 385. 17.0 662.
#> 3 ALB 2012 395. 62.4 688. 394. 0.0834 742.
#> 4 ALB 2015 412. 122. 711. 405. 93.6 825.
#> 5 ALB 2018 437. 96.5 789. 405. 152. 693.
#> 6 ARE 2009 421. 57.8 768. 431. 48.1 772.
#> 7 ARE 2012 434. 138. 862. 442. 75.5 785.
#> 8 ARE 2015 427. 91.8 793. 432. 54.4 827.
#> 9 ARE 2018 437. 87.6 865. 431. 84.0 814.
#> 10 ARG 2000 385. 16.0 675. 417. 84.2 761.
#> # ℹ 423 more rows
#> # ℹ 3 more variables: science_mean <dbl>, science_min <dbl>, science_max <dbl>
Let’s identify
- The key, the individual, who would have repeated measurements.
- The index, the time component.
- The regularity of the time interval (index).
Here it looks like the key is the student_id, which is nested within school_id and country,
And the index is year, so we would write the following
as_tsibble(pisa,
key = c(country),
index = year)
We can assess the regularity of the year like so:
index_regular(pisa, year)
#> [1] TRUE
index_summary(pisa, year)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 2000 2004 2009 2009 2014 2018
We can now convert this into a tsibble
:
pisa_ts <- as_tsibble(pisa,
key = country,
index = year,
regular = TRUE)
pisa_ts
#> # A tsibble: 433 x 11 [3Y]
#> # Key: country [100]
#> country year math_mean math_min math_max read_mean read_min read_max
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 ALB 2000 395. 27.4 722. 354. 59.7 640.
#> 2 ALB 2009 377. 79.6 706. 385. 17.0 662.
#> 3 ALB 2012 395. 62.4 688. 394. 0.0834 742.
#> 4 ALB 2015 412. 122. 711. 405. 93.6 825.
#> 5 ALB 2018 437. 96.5 789. 405. 152. 693.
#> 6 ARE 2009 421. 57.8 768. 431. 48.1 772.
#> 7 ARE 2012 434. 138. 862. 442. 75.5 785.
#> 8 ARE 2015 427. 91.8 793. 432. 54.4 827.
#> 9 ARE 2018 437. 87.6 865. 431. 84.0 814.
#> 10 ARG 2000 385. 16.0 675. 417. 84.2 761.
#> # ℹ 423 more rows
#> # ℹ 3 more variables: science_mean <dbl>, science_min <dbl>, science_max <dbl>
Conclusion
This idea of longitudinal data is core to brolgar. Understanding what
longitudinal data is, and how this can be linked to a time series
representation of data helps us understand our data structure, and gives
us access to more flexible tools. Other vignettes in the package will
further show why the time series tsibble
is useful.