There are many ways to describe longitudinal data - from panel data, cross-sectional data, and time series. We define longitudinal data as:

Information from the same individuals, recorded at multiple points in time.

To explore and model longitudinal data, It is important to understand what variables represent the individual components, and the time components, and how these identify an individual moving through time. Identifying the individual and time components can sometimes be a challenge, so this vignette walks through how to do this.

# Defining longitudinal data as a tsibble

The tools and workflows in brolgar are designed to work with a special tidy time series data frame called a tsibble. We can define our longitudinal data in terms of a time series to gain access to some really useful tools. To do so, we need to identify three components:

1. The key variable in your data is the identifier of your individual.
2. The index variable is the time component of your data.
3. The regularity of the time interval (index). Longitudinal data typically has irregular time periods between measurements, but can have regular measurements.

Together, time index and key uniquely identify an observation with repeated measurements

The term key is used a lot in brolgar, so it is an important idea to internalise:

The key is the identifier of your individuals or series

Why care about defining longitudinal data as a time series? Once we account for this time series structure inherent in longitudinal data, we gain access to a suite of nice tools that simplify and accelerate how we work with time series data.

brolgar is built on top of the powerful tsibble package by Earo Wang, if you would like to learn more, see the official package documentation or read the paper.

## Converting your longitudinal data to a time series

To convert longitudinal data into a “time series tibble”, a tsibble, we need to consider which variables identify:

1. The individual, who would have repeated measurements. This is the key
2. The time component, this is the index .
3. The regularity of the time interval (index).

Together, time index and key uniquely identify an observation with repeated measurements

The vignette now walks through some examples of converting longitudinal data into a tsibble.

# example data: wages

Let’s look at the wages data analysed in Singer & Willett (2003). This data contains measurements on hourly wages by years in the workforce, with education and race as covariates. The population measured was male high-school dropouts, aged between 14 and 17 years when first measured. Below is the first 10 rows of the data.

library(brolgar)
suppressPackageStartupMessages(library(dplyr))
slice(wages, 1:10) %>% knitr::kable()
id ln_wages xp ged xp_since_ged black hispanic high_grade unemploy_rate
31 1.491 0.015 1 0.015 0 1 8 3.21
31 1.433 0.715 1 0.715 0 1 8 3.21
31 1.469 1.734 1 1.734 0 1 8 3.21
31 1.749 2.773 1 2.773 0 1 8 3.30
31 1.931 3.927 1 3.927 0 1 8 2.89
31 1.709 4.946 1 4.946 0 1 8 2.49
31 2.086 5.965 1 5.965 0 1 8 2.60
31 2.129 6.984 1 6.984 0 1 8 4.80
36 1.982 0.315 1 0.315 0 0 9 4.89
36 1.798 0.983 1 0.983 0 0 9 7.40

To create a tsibble of the data we ask, “which variables identify…”:

1. The key, the individual, who would have repeated measurements.
2. The index, the time component.
3. The regularity of the time interval (index).

Together, time index and key uniquely identify an observation with repeated measurements

From this, we can say that:

1. The key is the variable id - the subject id, from 1-888.
2. The index is the variable xp the experience in years an individual has.
3. The data is irregular since the experience is a fraction of year that is not an integer.

We can use this information to create a tsibble of this data using as_tsibble

library(tsibble)
as_tsibble(x = wages,
key = id,
index = xp,
regular = FALSE)
#> # A tsibble: 6,402 x 9 [!]
#> # Key:       id [888]
#>       id ln_wages    xp   ged xp_since_ged black hispanic high_grade
#>    <int>    <dbl> <dbl> <int>        <dbl> <int>    <int>      <int>
#>  1    31     1.49 0.015     1        0.015     0        1          8
#>  2    31     1.43 0.715     1        0.715     0        1          8
#>  3    31     1.47 1.73      1        1.73      0        1          8
#>  4    31     1.75 2.77      1        2.77      0        1          8
#>  5    31     1.93 3.93      1        3.93      0        1          8
#>  6    31     1.71 4.95      1        4.95      0        1          8
#>  7    31     2.09 5.96      1        5.96      0        1          8
#>  8    31     2.13 6.98      1        6.98      0        1          8
#>  9    36     1.98 0.315     1        0.315     0        0          9
#> 10    36     1.80 0.983     1        0.983     0        0          9
#> # … with 6,392 more rows, and 1 more variable: unemploy_rate <dbl>

Note that regular = FALSE, since we have an irregular time series

Note the following information printed at the top of wages

# A tsibble: 6,402 x 9 [!]
# Key:       id [888]
...

This says:

• We have 6402 rows,
• with 9 columns.

The ! at the top means that there is no regular spacing between series

The “key” variable is then listed - id, of which there 888.

# example: heights data

The heights data is a little simpler than the wages data, and contains the average male heights in 144 countries from 1810-1989, with a smaller number of countries from 1500-1800.

It contains four variables:

• country
• continent
• year
• height_cm

To create a tsibble of the data we ask, “which variables identify…”:

1. The key, the individual, who would have repeated measurements.
2. The index, the time component.
3. The regularity of the time interval (index).

In this case:

• The individual is not a person, but a country
• The time is year
• The year is not regular because there are not measurements at a fixed year point.

This data is already a tsibble object, we can create a tsibble with the following code:

as_tsibble(x = heights,
key = country,
index = year,
regular = FALSE)
#> # A tsibble: 1,490 x 4 [!]
#> # Key:       country [144]
#>    country     continent  year height_cm
#>    <chr>       <chr>     <dbl>     <dbl>
#>  1 Afghanistan Asia       1870      168.
#>  2 Afghanistan Asia       1880      166.
#>  3 Afghanistan Asia       1930      167.
#>  4 Afghanistan Asia       1990      167.
#>  5 Afghanistan Asia       2000      161.
#>  6 Albania     Europe     1880      170.
#>  7 Albania     Europe     1890      170.
#>  8 Albania     Europe     1900      169.
#>  9 Albania     Europe     2000      168.
#> 10 Algeria     Africa     1910      169.
#> # … with 1,480 more rows

# example: gapminder

The gapminder R package contains a dataset of a subset of the gapminder study (link). This contains data on life expectancy, GDP per capita, and population by country.

library(gapminder)
gapminder
#> # A tibble: 1,704 x 6
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.
#>  3 Afghanistan Asia       1962    32.0 10267083      853.
#>  4 Afghanistan Asia       1967    34.0 11537966      836.
#>  5 Afghanistan Asia       1972    36.1 13079460      740.
#>  6 Afghanistan Asia       1977    38.4 14880372      786.
#>  7 Afghanistan Asia       1982    39.9 12881816      978.
#>  8 Afghanistan Asia       1987    40.8 13867957      852.
#>  9 Afghanistan Asia       1992    41.7 16317921      649.
#> 10 Afghanistan Asia       1997    41.8 22227415      635.
#> # … with 1,694 more rows

Let’s identify

1. The key, the individual, who would have repeated measurements.
2. The index, the time component.
3. The regularity of the time interval (index).

This is in fact very similar to the heights dataset:

1. The key is the country
2. The index is the year

To identify if the year is regular, we can do a bit of data exploration using index_summary()

gapminder %>%
group_by(country) %>%
index_summary(year)
#>     Min.  1st Qu.   Median     Mean  3rd Qu.     Max.
#> -55.0000   5.0000   5.0000   0.0323   5.0000   5.0000

This shows us that the year is every five - so now we know that this is a regular longitudinal dataset, and can be encoded like so:

as_tsibble(gapminder,
key = country,
index = year,
regular = TRUE)
#> # A tsibble: 1,704 x 6 [5Y]
#> # Key:       country [142]
#>    country     continent  year lifeExp      pop gdpPercap
#>    <fct>       <fct>     <int>   <dbl>    <int>     <dbl>
#>  1 Afghanistan Asia       1952    28.8  8425333      779.
#>  2 Afghanistan Asia       1957    30.3  9240934      821.
#>  3 Afghanistan Asia       1962    32.0 10267083      853.
#>  4 Afghanistan Asia       1967    34.0 11537966      836.
#>  5 Afghanistan Asia       1972    36.1 13079460      740.
#>  6 Afghanistan Asia       1977    38.4 14880372      786.
#>  7 Afghanistan Asia       1982    39.9 12881816      978.
#>  8 Afghanistan Asia       1987    40.8 13867957      852.
#>  9 Afghanistan Asia       1992    41.7 16317921      649.
#> 10 Afghanistan Asia       1997    41.8 22227415      635.
#> # … with 1,694 more rows

# example: PISA data

The PISA study measures school students around the world on a series of math, reading, and science scores. A subset of the data looks like so:

pisa
#> # A tibble: 172,959 x 9
#>     year country school_id student_id gender  math science  read stu_wgt
#>    <int> <fct>       <int>      <int>  <int> <dbl>   <dbl> <dbl>   <dbl>
#>  1  2000 AUS          1006          1      1   NA     469.  526     5.60
#>  2  2000 AUS          1006          2      2   NA     607.  559.    5.60
#>  3  2000 AUS          1006          3      2   NA     335.  383.    5.60
#>  4  2000 AUS          1006          4      1   NA     576   595     5.60
#>  5  2000 AUS          1006          7      2  550.    585.  497.    5.60
#>  6  2000 AUS          1006          9      2   NA     516.  532.    5.60
#>  7  2000 AUS          1006         10      2   NA     514.  575.    5.60
#>  8  2000 AUS          1006         12      2  377     384.  325.    5.60
#>  9  2000 AUS          1006         13      2   NA     458.  452.    5.60
#> 10  2000 AUS          1006         14      2  588.    499   507.    5.60
#> # … with 172,949 more rows

Let’s identify

1. The key, the individual, who would have repeated measurements.
2. The index, the time component.
3. The regularity of the time interval (index).

Here it looks like the key is the student_id, which is nested within school_id and country,

And the index is year, so we would write the following

as_tsibble(pisa,
key = c(country, school_id, student_id),
index = year)

Unfortunately, we get this error:

Error: A valid tsibble must have distinct rows identified by key and index.
Please use duplicates() to check the duplicated rows.
Run rlang::last_error() to see where the error occurred.

This is a somewhat confusing error - we can check duplicates like so:

library(tsibble)
#>
#> Attaching package: 'tsibble'
#> The following object is masked from 'package:dplyr':
#>
#>     id
duplicates(pisa, key = c(country, school_id, student_id), index = year)
#> # A tibble: 58,107 x 9
#>     year country school_id student_id gender  math science  read stu_wgt
#>    <int> <fct>       <int>      <int>  <int> <dbl>   <dbl> <dbl>   <dbl>
#>  1  2015 AUS       3600000    3611000      3  548.    567.  584.    28.2
#>  2  2015 AUS       3600000    3612000      1  533.    557.  551.    28.2
#>  3  2015 AUS       3600000    3602000      2  501.    494.  549.    28.2
#>  4  2015 AUS       3600000    3606000      1  504.    503   531.    28.2
#>  5  2015 AUS       3600000    3608000      1  519.    497.  488.    33.4
#>  6  2015 AUS       3600000    3610000      1  389.    434.  454.    33.4
#>  7  2015 AUS       3600000    3605000     NA  415.    465.  482     33.4
#>  8  2015 AUS       3600000    3606000     NA  314.    299.  358.    33.4
#>  9  2015 AUS       3600000    3607000      3  546.    594.  574.    42.5
#> 10  2015 AUS       3600000    3606000      1  574.    659.  642.    42.5
#> # … with 58,097 more rows

One thing to keep in mind here is that individual students are not measured repeatedly, but schools are. This means that we really shouldn’t include student_id in the tsibble, since they have no repeated measurements.

Understanding a bit more about the PISA data, the school_id and student_id are not unique across time. The id codes represent unique schools and students for a given year in a country and school. This is still interesting information, but illustrates the importance of understanding what is the longitudinal element in the data.

In this case, the longitudinal element is the country within a given year.

We can cast this as a tsibble, but we need to aggregate the data to each year and country. In doing so, it is important that we provide some summary statistics of each of the scores - we want to include the mean, and minimum and maximum of the math, reading, and science scores, so that we do not lose the information of the individuals.

The code below does this, first grouping by year and country, and then calculating the weighted mean for math, reading, and science. This can be done using the student weight variable stu_wgt, to get the survey weighted mean. The minimum and maximum are then calculated.

pisa_country <- pisa %>%
group_by(year, country) %>%
summarise(math_mean = weighted.mean(math, stu_wgt, na.rm=TRUE),
science_mean = weighted.mean(science, stu_wgt, na.rm=TRUE),
math_max = max(math, na.rm=TRUE),
science_max = max(science, na.rm=TRUE),
math_min = min(math, na.rm=TRUE),
science_min = min(science, na.rm=TRUE)) %>%
ungroup()

pisa_country
#> # A tibble: 21 x 11
#>    <int> <fct>       <dbl>     <dbl>        <dbl>    <dbl>    <dbl>       <dbl>
#>  1  2000 AUS          530.      529.         527.     784.     828.        800.
#>  2  2000 IDN          371.      376.         394.     646.     610.        680.
#>  3  2000 NZL          537.      529.         528.     796      834.        830.
#>  4  2003 AUS          524.      525.         526.     833.     868.        848.
#>  5  2003 IDN          360.      395.         381.     674.     666.        671.
#>  6  2003 NZL          524.      522.         522.     844.     857.        868.
#>  7  2006 AUS          520.      527.         512.     814.     869         808.
#>  8  2006 IDN          391.      393.         393.     720.     671.        641
#>  9  2006 NZL          521.      530.         521.     831.     887.        890.
#> 10  2009 AUS          515.      528.         515.     890.     884.        803.
#> # … with 11 more rows, and 3 more variables: math_min <dbl>, read_min <dbl>,
#> #   science_min <dbl>

We can assess the regularity of the year like so:

index_regular(pisa, year)
#> [1] TRUE
index_summary(pisa, year)
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
#>       3       3       3       3       3       3

We can now convert this into a tsibble:

pisa_ts <-
as_tsibble(pisa_country,
key = country,
index = year,
regular = TRUE)

pisa_ts
#> # A tsibble: 21 x 11 [3Y]
#> # Key:       country [3]
#>    <int> <fct>       <dbl>     <dbl>        <dbl>    <dbl>    <dbl>       <dbl>
#>  1  2000 AUS          530.      529.         527.     784.     828.        800.
#>  2  2003 AUS          524.      525.         526.     833.     868.        848.
#>  3  2006 AUS          520.      527.         512.     814.     869         808.
#>  4  2009 AUS          515.      528.         515.     890.     884.        803.
#>  5  2012 AUS          504.      521.         512.     849      843.        818.
#>  6  2015 AUS          494.      510.         503.     808.     877.        851.
#>  7  2018 AUS          492.      502.         502.     863.     879.        888.
#>  8  2000 IDN          371.      376.         394.     646.     610.        680.
#>  9  2003 IDN          360.      395.         381.     674.     666.        671.
#> 10  2006 IDN          391.      393.         393.     720.     671.        641
#> # … with 11 more rows, and 3 more variables: math_min <dbl>, read_min <dbl>,
#> #   science_min <dbl>

# Conclusion

This idea of longitudinal data is core to brolgar. Understanding what longitudinal data is, and how this can be linked to a time series representation of data helps us understand our data structure, and gives us access to more flexible tools. Other vignettes in the package will further show why the time series tsibble is useful.