All data has structure. There are ways that we can specify some permanent structure to the data, which gives some useful features later on, for free. This vignette discusses how to structure your longitudinal data as a time series, and what that means.

This idea, that longitudinal data is a time series, is Big Idea #1 behind the brolgar package.

Anything that is observed sequentially over time is a time series. – Professors Rob Hyndman and George Athanasopolous

Longitudinal data has a few other names, such as “panel data”. I used to always think that “time series” was defined as something that was by definition “regular” - with equal spacings between observations. This is actually not the case - you can have both “regular”, and “irregular” time series.

Why care about using a time series? Once we account for this time series structure inherent in longitudinal data, we gain access to a suite of nice tools that simplify and accelerate how we work with time series data. brolgar is built on top of the powerful tsibble package by Earo Wang, if you would like to learn more, see the official package documentation or read the paper.

## Converting your longitudinal data to a time series

To convert longitudinal data into a “time series tibble”, a tsibble, we need to consider:

• What identifies the time component of the data? This is the index
• What is the unique identifier of an individual/series? This is the key

Together, the index and key uniquely identify an observation.

What do we mean by this? Let’s look at the first section of the wages, wages data analysed in Singer & Willett (2003):

library(brolgar)
suppressPackageStartupMessages(library(dplyr))
slice(wages, 1:10)
#> # A tsibble: 10 x 9 [!]
#> # Key:       id [2]
#>       id ln_wages    xp   ged xp_since_ged black hispanic high_grade
#>    <int>    <dbl> <dbl> <int>        <dbl> <int>    <int>      <int>
#>  1    31     1.49 0.015     1        0.015     0        1          8
#>  2    31     1.43 0.715     1        0.715     0        1          8
#>  3    31     1.47 1.73      1        1.73      0        1          8
#>  4    31     1.75 2.77      1        2.77      0        1          8
#>  5    31     1.93 3.93      1        3.93      0        1          8
#>  6    31     1.71 4.95      1        4.95      0        1          8
#>  7    31     2.09 5.96      1        5.96      0        1          8
#>  8    31     2.13 6.98      1        6.98      0        1          8
#>  9    36     1.98 0.315     1        0.315     0        0          9
#> 10    36     1.80 0.983     1        0.983     0        0          9
#> # … with 1 more variable: unemploy_rate <dbl>

We have the id column, which identifies an individual.

We also have the xp column, which identifies the experience an individual has.

So:

• key: id
• index: xp

We could create a tsibble of this data by using the as_tsibble function from tsibble, also stating, regular = FALSE, since we have an irregular time series (the measurements are not regularly measured, and have a varying distance between each measurement):

library(tsibble)
as_tsibble(x = wages,
key = id,
index = xp,
regular = FALSE)
#> # A tsibble: 6,402 x 9 [!]
#> # Key:       id [888]
#>       id ln_wages    xp   ged xp_since_ged black hispanic high_grade
#>    <int>    <dbl> <dbl> <int>        <dbl> <int>    <int>      <int>
#>  1    31     1.49 0.015     1        0.015     0        1          8
#>  2    31     1.43 0.715     1        0.715     0        1          8
#>  3    31     1.47 1.73      1        1.73      0        1          8
#>  4    31     1.75 2.77      1        2.77      0        1          8
#>  5    31     1.93 3.93      1        3.93      0        1          8
#>  6    31     1.71 4.95      1        4.95      0        1          8
#>  7    31     2.09 5.96      1        5.96      0        1          8
#>  8    31     2.13 6.98      1        6.98      0        1          8
#>  9    36     1.98 0.315     1        0.315     0        0          9
#> 10    36     1.80 0.983     1        0.983     0        0          9
#> # … with 6,392 more rows, and 1 more variable: unemploy_rate <dbl>

Note the following information printed at the top of wages

# A tsibble: 6,402 x 9 [!]
# Key:       id [888]
...

This says:

• We have 6402 rows,
• with 9 columns.

The ! at the top means that there is no regular spacing between series

The “key” variable is then listed - id, of which there 888.

# Using time series data in brolgar

This idea of longitudinal data is core to brolgar. Other vignettes in the package will further show why the time series tsibble is useful.