Finding Features in data

When you are presented with longitudinal data, it is useful to summarise the data into a format where you have one row per key. Say for example if you wanted to take the wages data

library(brolgar)
wages
#> # A tsibble: 6,402 x 9 [!]
#> # Key:       id [888]
#>       id ln_wages    xp   ged xp_since_ged black hispanic high_grade
#>    <int>    <dbl> <dbl> <int>        <dbl> <int>    <int>      <int>
#>  1    31     1.49 0.015     1        0.015     0        1          8
#>  2    31     1.43 0.715     1        0.715     0        1          8
#>  3    31     1.47 1.73      1        1.73      0        1          8
#>  4    31     1.75 2.77      1        2.77      0        1          8
#>  5    31     1.93 3.93      1        3.93      0        1          8
#>  6    31     1.71 4.95      1        4.95      0        1          8
#>  7    31     2.09 5.96      1        5.96      0        1          8
#>  8    31     2.13 6.98      1        6.98      0        1          8
#>  9    36     1.98 0.315     1        0.315     0        0          9
#> 10    36     1.80 0.983     1        0.983     0        0          9
#> # … with 6,392 more rows, and 1 more variable: unemploy_rate <dbl>

And then return one row for each key, with say the minimum value for ln_wages, for each key:

#> # A tibble: 888 x 2
#>       id   min
#>    <int> <dbl>
#>  1    31 1.43 
#>  2    36 1.80 
#>  3    53 1.54 
#>  4   122 0.763
#>  5   134 2.00 
#>  6   145 1.48 
#>  7   155 1.54 
#>  8   173 1.56 
#>  9   206 2.03 
#> 10   207 1.58 
#> # … with 878 more rows

This then allows us to summarise these kinds of data, to say for example find the distribution of minimum values:

library(ggplot2)
ggplot(wages_min,
       aes(x = min)) + 
  geom_density()

We call these summaries features of the data.

This vignette discusses how to calculate these features of the data.

Calculating features

We can calculate features of longitudinal data using the features function (from fabletools, made available in brolgar).

features works by specifying the data, the variable to summarise, and the feature to calculate:

features(<DATA>, <VARIABLE>, <FEATURE>)

or with the pipe:

<DATA> %>% features(<VARIABLE>, <FEATURE>)

As an example, we can calculate a five number summary (minimum, 25th quantile, median, mean, 75th quantile, and maximum) of the data like so:

wages_five <- wages %>%
  features(ln_wages, feat_five_num)

wages_five
#> # A tibble: 888 x 6
#>       id   min   q25   med   q75   max
#>    <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1    31 1.43   1.48  1.73  2.02  2.13
#>  2    36 1.80   1.97  2.32  2.59  2.93
#>  3    53 1.54   1.58  1.71  1.89  3.24
#>  4   122 0.763  2.10  2.19  2.46  2.92
#>  5   134 2.00   2.28  2.36  2.79  2.93
#>  6   145 1.48   1.58  1.77  1.89  2.04
#>  7   155 1.54   1.83  2.22  2.44  2.64
#>  8   173 1.56   1.68  2.00  2.05  2.34
#>  9   206 2.03   2.07  2.30  2.45  2.48
#> 10   207 1.58   1.87  2.15  2.26  2.66
#> # … with 878 more rows

Here we are taking the wages data, piping it to features, and then telling it to summarise the ln_wages variable, using feat_five_num. brolgar provides a set of features in the package, which all start with feat_.

You can, for example, find those whose values only increase or decrease with feat_monotonic:

wages_mono <- wages %>%
  features(ln_wages, feat_monotonic)

wages_mono
#> # A tibble: 888 x 5
#>       id increase decrease unvary monotonic
#>    <int> <lgl>    <lgl>    <lgl>  <lgl>    
#>  1    31 FALSE    FALSE    FALSE  FALSE    
#>  2    36 FALSE    FALSE    FALSE  FALSE    
#>  3    53 FALSE    FALSE    FALSE  FALSE    
#>  4   122 FALSE    FALSE    FALSE  FALSE    
#>  5   134 FALSE    FALSE    FALSE  FALSE    
#>  6   145 FALSE    FALSE    FALSE  FALSE    
#>  7   155 FALSE    FALSE    FALSE  FALSE    
#>  8   173 FALSE    FALSE    FALSE  FALSE    
#>  9   206 TRUE     FALSE    FALSE  TRUE     
#> 10   207 FALSE    FALSE    FALSE  FALSE    
#> # … with 878 more rows

These could then be used to identify individuals who only increase like so:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
wages_mono %>%
  filter(increase)
#> # A tibble: 50 x 5
#>       id increase decrease unvary monotonic
#>    <int> <lgl>    <lgl>    <lgl>  <lgl>    
#>  1   206 TRUE     FALSE    FALSE  TRUE     
#>  2   295 TRUE     FALSE    FALSE  TRUE     
#>  3   518 TRUE     FALSE    FALSE  TRUE     
#>  4  1508 TRUE     FALSE    FALSE  TRUE     
#>  5  2178 TRUE     FALSE    FALSE  TRUE     
#>  6  2194 TRUE     FALSE    FALSE  TRUE     
#>  7  2330 TRUE     FALSE    FALSE  TRUE     
#>  8  2456 TRUE     FALSE    FALSE  TRUE     
#>  9  2612 TRUE     FALSE    FALSE  TRUE     
#> 10  2890 TRUE     FALSE    FALSE  TRUE     
#> # … with 40 more rows

They could then be joined back to the data

wages_mono_join <- wages_mono %>%
  filter(increase) %>%
  left_join(wages, by = "id")

wages_mono_join
#> # A tibble: 164 x 13
#>       id increase decrease unvary monotonic ln_wages    xp   ged
#>    <int> <lgl>    <lgl>    <lgl>  <lgl>        <dbl> <dbl> <int>
#>  1   206 TRUE     FALSE    FALSE  TRUE          2.03 1.87      0
#>  2   206 TRUE     FALSE    FALSE  TRUE          2.30 2.81      0
#>  3   206 TRUE     FALSE    FALSE  TRUE          2.48 4.31      0
#>  4   295 TRUE     FALSE    FALSE  TRUE          1.79 2.03      0
#>  5   295 TRUE     FALSE    FALSE  TRUE          1.81 3.12      0
#>  6   295 TRUE     FALSE    FALSE  TRUE          2.11 4.16      0
#>  7   295 TRUE     FALSE    FALSE  TRUE          2.13 5.08      0
#>  8   295 TRUE     FALSE    FALSE  TRUE          2.31 6.58      0
#>  9   518 TRUE     FALSE    FALSE  TRUE          1.27 0.525     1
#> 10   518 TRUE     FALSE    FALSE  TRUE          1.61 1.93      1
#> # … with 154 more rows, and 5 more variables: xp_since_ged <dbl>,
#> #   black <int>, hispanic <int>, high_grade <int>, unemploy_rate <dbl>

And these could be plotted:

ggplot(wages_mono_join,
       aes(x = xp,
           y = ln_wages,
           group = id)) + 
  geom_line()

To get a sense of the data and where ti came from, we could create a plot with gghighlight to highlight those that only increase, by using gghighlight(increase) - since increase is a logical, this tells gghighlight to highlight those that are TRUE.

library(gghighlight)
wages_mono %>%
  left_join(wages, by = "id") %>%
  ggplot(aes(x = xp,
             y = ln_wages,
             group = id)) +
  geom_line() + 
  gghighlight(increase)

You can explore the available features, see the function References

Creating your own Features

To create your own features or summaries to pass to features, you provide a named list of functions. For example:

library(brolgar)
library(feasts)
#> Loading required package: fablelite
#> Registered S3 methods overwritten by 'fablelite':
#>   method                   from      
#>   $.hilo                   fabletools
#>   [.agg_key                fabletools
#>   [.fbl_ts                 fabletools
#>   [.fcdist                 fabletools
#>   [.hilo                   fabletools
#>   [.lst_mdl                fabletools
#>   Ops.fcdist               fabletools
#>   Ops.lst_mdl              fabletools
#>   Ops.mdl_defn             fabletools
#>   Ops.mdl_ts               fabletools
#>   as.data.frame.hilo       fabletools
#>   as_tibble.mdl_df         fabletools
#>   as_tsibble.dcmp_ts       fabletools
#>   as_tsibble.fbl_ts        fabletools
#>   as_tsibble.grouped_fbl   fabletools
#>   augment.mdl_df           fabletools
#>   augment.mdl_ts           fabletools
#>   autolayer.fbl_ts         fabletools
#>   autolayer.tbl_ts         fabletools
#>   autoplot.dcmp_ts         fabletools
#>   autoplot.fbl_ts          fabletools
#>   autoplot.tbl_ts          fabletools
#>   c.fcdist                 fabletools
#>   c.hilo                   fabletools
#>   c.lst_mdl                fabletools
#>   components.mdl_df        fabletools
#>   components.mdl_ts        fabletools
#>   duplicated.hilo          fabletools
#>   equation.mdl_df          fabletools
#>   equation.mdl_ts          fabletools
#>   fitted.mdl_df            fabletools
#>   fitted.mdl_ts            fabletools
#>   fitted.model_combination fabletools
#>   fitted.null_mdl          fabletools
#>   format.agg_key           fabletools
#>   format.fcdist            fabletools
#>   format.hilo              fabletools
#>   format.lst_mdl           fabletools
#>   fortify.fbl_ts           fabletools
#>   gather.mdl_df            fabletools
#>   generate.mdl_df          fabletools
#>   generate.mdl_ts          fabletools
#>   generate.null_mdl        fabletools
#>   glance.mdl_df            fabletools
#>   glance.mdl_ts            fabletools
#>   glance.null_mdl          fabletools
#>   group_by.fbl_ts          fabletools
#>   group_by.grouped_fbl     fabletools
#>   guide_geom.guide_level   fabletools
#>   interpolate.mdl_df       fabletools
#>   interpolate.mdl_ts       fabletools
#>   is.na.hilo               fabletools
#>   key.mdl_df               fabletools
#>   key_data.mdl_df          fabletools
#>   key_vars.mdl_df          fabletools
#>   length.fcdist            fabletools
#>   length.mdl_ts            fabletools
#>   mutate.fbl_ts            fabletools
#>   mutate.grouped_fbl       fabletools
#>   mutate.mdl_df            fabletools
#>   print.agg_key            fabletools
#>   print.fcdist             fabletools
#>   print.hilo               fabletools
#>   print.lst_mdl            fabletools
#>   print.mdl_ts             fabletools
#>   print.transformation     fabletools
#>   quantile.fcdist          fabletools
#>   rbind.dcmp_ts            fabletools
#>   rbind.fbl_ts             fabletools
#>   refit.mdl_df             fabletools
#>   refit.mdl_ts             fabletools
#>   refit.null_mdl           fabletools
#>   rename.mdl_df            fabletools
#>   rep.fcdist               fabletools
#>   rep.hilo                 fabletools
#>   residuals.mdl_df         fabletools
#>   residuals.mdl_ts         fabletools
#>   residuals.null_mdl       fabletools
#>   scale_type.yearmonth     fabletools
#>   scale_type.yearquarter   fabletools
#>   scale_type.yearweek      fabletools
#>   select.fbl_ts            fabletools
#>   select.grouped_fbl       fabletools
#>   select.mdl_df            fabletools
#>   tidy.mdl_df              fabletools
#>   tidy.mdl_ts              fabletools
#>   tidy.null_mdl            fabletools
#>   ungroup.fbl_ts           fabletools
#>   ungroup.grouped_fbl      fabletools
#>   unique.fcdist            fabletools
#>   unique.hilo              fabletools
#> 
#> Attaching package: 'fablelite'
#> The following objects are masked from 'package:brolgar':
#> 
#>     features, features_all, features_at, features_if
#> 
#> Attaching package: 'feasts'
#> The following object is masked from 'package:grDevices':
#> 
#>     X11
feat_three <- list(min = min,
                   med = median,
                   max = max)

feat_three
#> $min
#> function (..., na.rm = FALSE)  .Primitive("min")
#> 
#> $med
#> function (x, na.rm = FALSE, ...) 
#> UseMethod("median")
#> <bytecode: 0x7f957d0ab070>
#> <environment: namespace:stats>
#> 
#> $max
#> function (..., na.rm = FALSE)  .Primitive("max")

These are then passed to features like so:

wages %>%
  features(ln_wages, feat_three)
#> # A tibble: 888 x 4
#>       id   min   med   max
#>    <int> <dbl> <dbl> <dbl>
#>  1    31 1.43   1.73  2.13
#>  2    36 1.80   2.32  2.93
#>  3    53 1.54   1.71  3.24
#>  4   122 0.763  2.19  2.92
#>  5   134 2.00   2.36  2.93
#>  6   145 1.48   1.77  2.04
#>  7   155 1.54   2.22  2.64
#>  8   173 1.56   2.00  2.34
#>  9   206 2.03   2.30  2.48
#> 10   207 1.58   2.15  2.66
#> # … with 878 more rows

heights %>%
  features(height_cm, feat_three)
#> # A tibble: 153 x 4
#>    country       min   med   max
#>    <chr>       <dbl> <dbl> <dbl>
#>  1 Afghanistan  161.  167.  168.
#>  2 Albania      168.  170.  170.
#>  3 Algeria      166.  169   171.
#>  4 Angola       159.  167.  169.
#>  5 Argentina    167.  168.  174.
#>  6 Armenia      164.  169.  172.
#>  7 Australia    170   172.  178.
#>  8 Austria      162.  167.  179.
#>  9 Azerbaijan   170.  172.  172.
#> 10 Bahrain      161.  164.  164 
#> # … with 143 more rows

Inside brolgar, the features are created with the following syntax:

feat_five_num <- function(x, ...) {
  list(
    min = b_min(x, ...),
    q25 = b_q25(x, ...),
    med = b_median(x, ...),
    q75 = b_q75(x, ...),
    max = b_max(x, ...)
  )
}

Here the functions b_ are functions with a default of na.rm = TRUE, and in the cases of quantiles, they use type = 8, and names = FALSE.

Accessing sets of features

If you want to run many or all features from a package on your data you can collect them all with feature_set. For example:

library(fabletools)
#> 
#> Attaching package: 'fabletools'
#> The following objects are masked from 'package:fablelite':
#> 
#>     accuracy, ACF1, aggregate_key, as_dable, as_fable, as_mable,
#>     bias_adjust, box_cox, combination_ensemble, combination_model,
#>     common_periods, construct_fc, dable, decomposition_definition,
#>     decomposition_model, dist_mv_normal, dist_normal, dist_sim,
#>     dist_unknown, distribution_accuracy_measures, estimate, fable,
#>     feature_set, features, features_all, features_at, features_if,
#>     forecast, GeomForecast, get_frequencies, guide_level, hilo,
#>     interval_accuracy_measures, inv_box_cox,
#>     invert_transformation, is_aggregated, is_fable, is_hilo,
#>     is_mable, is_model, is_null_model, mable, MAE, MAPE, MASE, ME,
#>     min_trace, model, model_definition, model_lhs, model_rhs,
#>     model_sum, MPE, MSE, new_decomposition_class,
#>     new_decomposition_definition, new_fcdist, new_fcdist_env,
#>     new_hilo, new_model_class, new_model_definition, new_specials,
#>     new_transformation, null_model, parse_model, parse_model_lhs,
#>     parse_model_rhs, percentile_score, point_accuracy_measures,
#>     reconcile, register_feature, report, response, RMSE,
#>     scale_level_continuous, scale_level_gradient,
#>     scale_x_yearmonth, scale_x_yearquarter, scale_x_yearweek,
#>     StatForecast, stream, traverse, validate_formula,
#>     winkler_score
feat_brolgar <- feature_set(pkgs = "brolgar")
length(feat_brolgar)
#> [1] 6

You could then run these like so:

wages %>%
  features(ln_wages, feat_brolgar)
#> # A tibble: 888 x 37
#>       id   min   med   max  min1   q25  med1   q75  max1  min2  max2
#>    <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#>  1    31 1.43   1.73  2.13 1.43   1.48  1.73  2.02  2.13 1.43   2.13
#>  2    36 1.80   2.32  2.93 1.80   1.97  2.32  2.59  2.93 1.80   2.93
#>  3    53 1.54   1.71  3.24 1.54   1.58  1.71  1.89  3.24 1.54   3.24
#>  4   122 0.763  2.19  2.92 0.763  2.10  2.19  2.46  2.92 0.763  2.92
#>  5   134 2.00   2.36  2.93 2.00   2.28  2.36  2.79  2.93 2.00   2.93
#>  6   145 1.48   1.77  2.04 1.48   1.58  1.77  1.89  2.04 1.48   2.04
#>  7   155 1.54   2.22  2.64 1.54   1.83  2.22  2.44  2.64 1.54   2.64
#>  8   173 1.56   2.00  2.34 1.56   1.68  2.00  2.05  2.34 1.56   2.34
#>  9   206 2.03   2.30  2.48 2.03   2.07  2.30  2.45  2.48 2.03   2.48
#> 10   207 1.58   2.15  2.66 1.58   1.87  2.15  2.26  2.66 1.58   2.66
#> # … with 878 more rows, and 26 more variables: range_diff <dbl>,
#> #   iqr <dbl>, var <dbl>, sd <dbl>, mad <dbl>, iqr1 <dbl>, min3 <dbl>,
#> #   max3 <dbl>, median <dbl>, mean <dbl>, q251 <dbl>, q751 <dbl>,
#> #   range1 <dbl>, range2 <dbl>, range_diff1 <dbl>, sd1 <dbl>, var1 <dbl>,
#> #   mad1 <dbl>, iqr2 <dbl>, increase <dbl>, decrease <dbl>, unvary <dbl>,
#> #   increase1 <lgl>, decrease1 <lgl>, unvary1 <lgl>, monotonic <lgl>

For more information see ?fabletools::feature_set

Registering a feature in a package

If you create features in your own package and want to make them accessible with feature_set, do the following.

Functions can be registered via fabletools::register_feature(). To register features in a package, I create a file called zzz.R, and use the .onLoad(...) function to set this up on loading the package:

.onLoad <- function(...) {
  fabletools::register_feature(feat_three_num, c("summary"))
  # ... and as many as you want here!
}