11 Separating variables

11 Separating variables🔗ℹ

One of the principles of tidy data (as discussed in Pivoting) is that every column represents a single variable. So when a column represents two or more variables, or when a column has half of a variable (which is more rare), that data is not tidy regardless of its pivoting.

These operations are designed to separate string variables into different columns, or unite two variables into one column.

The following example data will be used in this section:

> (define to-separate
    (row-df [col]
            #f
            "a-b"
            "a-d"
            "b-c"
            "d-e"))

procedure
(separate df
column-name
#:into into
[ #:separator separator
#:remove? remove?
#:fill fill]) → data-frame?
  df : data-frame?
  column-name : string?
  into : (non-empty-listof (or/c string? #f))
   separator :
(or/c string?
      regexp?
      exact-nonnegative-integer?
      (listof exact-nonnegative-integer?))
= #px"[^[:alnum:]]+"
  remove? : boolean? = #t
  fill : (or/c 'left 'right) = 'right

Returns a new data-frame with the same data as df, except the contents of column-name are split into separate variables according to separator, with the names specified by into for each split of the result.

The variable stored in column-name must be a string variable (alongside NA values).

into specifies what variables (index-wise) the result should be placed into. If #f is placed in into, that result of the split is skipped.

separator is either a string, a regular expression, a number, or a list of numbers.

If separator is a string or regular expression, regexp-split is used to split up the values of column-name.
If separator is a number, the string is split into two pieces along that index into the string.
If separator is a list of numbers, the string is split into multiple pieces along those indices.

The default separator splits along all non-alphanumeric characters.

If there are more results of a split than there are variables in into, the excess variables are dropped.

If there are less results of a split than there are variables in into, NAs are inserted to the rightmost variables in the list if fill is 'right, and to the leftmost if it is 'left.

If remove? is true, column-name will not be present in the output data-frame.

This function does not worked on grouped data frames, due to its potential to destroy group invariants.

Examples:

> (~> to-separate
      (separate "col" #:into '("A" "B"))
      show)
data-frame: 5 rows x 2 columns
┌──┬──┐
│A │B │
├──┼──┤
│#f│#f│
├──┼──┤
│a │b │
├──┼──┤
│a │d │
├──┼──┤
│b │c │
├──┼──┤
│d │e │
└──┴──┘
> (~> to-separate
      (separate "col" #:into '("A" #f) #:remove? #f)
      show)
data-frame: 5 rows x 2 columns
┌───┬──┐
│col│A │
├───┼──┤
│#f │#f│
├───┼──┤
│a-b│a │
├───┼──┤
│a-d│a │
├───┼──┤
│b-c│b │
├───┼──┤
│d-e│d │
└───┴──┘
> (~> to-separate
      (separate "col" #:into '("A" "B") #:separator 1)
      show)
data-frame: 5 rows x 2 columns
┌──┬──┐
│A │B │
├──┼──┤
│#f│#f│
├──┼──┤
│a │-b│
├──┼──┤
│a │-d│
├──┼──┤
│b │-c│
├──┼──┤
│d │-e│
└──┴──┘

procedure
(extract df
column-name
#:into into
[ #:regex regex
#:remove? remove?]) → data-frame?
  df : data-frame?
  column-name : string?
  into : (non-empty-listof (or/c string? #f))
  regex : regexp? = #px"([[:alnum:]]+)"
  remove? : boolean? = #t

Like separate, but uses regular expression capturing groups, rather than splitting the contents of column-name.

The variable stored in column-name must be a string variable (alongside NA values).

regex is a regular expression with capturing groups. regex is expected to have as many or less capturing groups (excluding group 0) as into has variables. By default, this captures the first alphanumeric sequence.

If the regex has no match, NA will appear in the output columns.

If remove? is true, column-name will not be present in the output data-frame.

This function does not work on grouped data frames, due to its potential to destroy group invariants.

Examples:

> (~> to-separate
      (extract "col" #:into '("A"))
      show)
data-frame: 5 rows x 1 columns
┌──┐
│A │
├──┤
│#f│
├──┤
│a │
├──┤
│a │
├──┤
│b │
├──┤
│d │
└──┘
> (~> to-separate
      (extract "col"
               #:into '("A" "B")
               #:regex #px"([a-d]+)-([a-d]+)")
      show)
data-frame: 5 rows x 2 columns
┌──┬──┐
│A │B │
├──┼──┤
│#f│#f│
├──┼──┤
│a │b │
├──┼──┤
│a │d │
├──┼──┤
│b │c │
├──┼──┤
│#f│#f│
└──┴──┘

procedure
(unite df
column-name
#:from from
[ #:combine combine-fn
#:remove? remove?]) → data-frame?
  df : data-frame?
  column-name : string?
  from : (non-empty-listof string?)
   combine-fn : (-> any/c ... any/c)
= (λ args (string-join (filter (λ (x) x) args) "_"))
  remove? : boolean? = #t

The inverse transformation to separate. Takes the variables in from in df, and combines them using combine-fn into a new column column-name.

By default, combine-fn takes each present variable, and inserts an underscore between them.

If remove? is true, all columns in from will not be present in the output data-frame.

This function does not work on grouped data frames, due to its potential to destroy group invariants.

Examples:

> (define to-unite
    (row-df [str-a str-b num-a num-b]
            #f     "c"   13    90
            "a"    "b"   19    20
            "b"    #f    50    90
            "e"    "d"   59    25))
> (~> to-unite
      (unite "str" #:from '("str-a" "str-b"))
      (unite "num" #:from '("num-a" "num-b") #:combine +)
      show)
data-frame: 4 rows x 2 columns
┌───┬───┐
│str│num│
├───┼───┤
│c  │103│
├───┼───┤
│a_b│39 │
├───┼───┤
│b  │140│
├───┼───┤
│e_d│84 │
└───┴───┘

1	Constructing data-frames
2	Displaying data
3	Grouping and splitting
4	Filtering
5	Slicing
6	Creating and modifying columns
7	Summarizing
8	Joining
9	Sorting
10	Pivoting
11	Separating variables
12	Unnesting
13	Missing values