8 Joining

8 Joining🔗ℹ

These operations join two tables in varying ways, along some common column or set of columns (the "spine"). When multiple columns are specified, the data-set is joined along their combinations. When none are specified, the spine defaults to all shared variables.

All joining operations ignore grouping. The grouping of the first argument will be preserved, but grouping does not play a factor in how the operation performs (compared to a plain table).

The following example data-frames are used in this section:

> (define woodland1
    (row-df [site habitat]
             "b"  "grassland"
             "a"  "meadow"
             "c"  "woodland"))
> (define woodland2
    (row-df [site day catch]
             "c"  1   10
             "b"  1   12
             "c"  2   20
             "b"  2   24))

8.1 Combining joins🔗ℹ

These joins combine variables from the two input data-frames.

procedure
(left-join df1 df2 by ...)
→ (or/c data-frame? grouped-data-frame?)
  df1 : (or/c data-frame? grouped-data-frame?)
  df2 : (or/c data-frame? grouped-data-frame?)
  by : string?

Returns a new data-frame, with all rows from df1, and all columns from df1 and df2. Rows are compared by the value of the variable(s) by.

If by is not specified, it defaults to all the common columns in df1 and df2, in an unspecified order.

Rows in df1 that have no corresponding value(s) of by in df2 will have NA values in the new columns.

Rows in df1 that have multiple corresponding values of by in df2 will have all combinations of the df1 and df2 values in the result.

Examples:

> (~> woodland1
      (left-join woodland2 "site")
      show)
data-frame: 5 rows x 4 columns
┌─────┬────┬─────────┬───┐
│catch│site│habitat  │day│
├─────┼────┼─────────┼───┤
│#f   │a   │meadow   │#f │
├─────┼────┼─────────┼───┤
│12   │b   │grassland│1  │
├─────┼────┼─────────┼───┤
│24   │b   │grassland│2  │
├─────┼────┼─────────┼───┤
│10   │c   │woodland │1  │
├─────┼────┼─────────┼───┤
│20   │c   │woodland │2  │
└─────┴────┴─────────┴───┘
> (~> woodland2
      (left-join woodland1 "site")
      show)
data-frame: 4 rows x 4 columns
┌─────┬────┬───┬─────────┐
│catch│site│day│habitat  │
├─────┼────┼───┼─────────┤
│12   │b   │1  │grassland│
├─────┼────┼───┼─────────┤
│24   │b   │2  │grassland│
├─────┼────┼───┼─────────┤
│10   │c   │1  │woodland │
├─────┼────┼───┼─────────┤
│20   │c   │2  │woodland │
└─────┴────┴───┴─────────┘

procedure
(right-join df1 df2 by ...)
→ (or/c data-frame? grouped-data-frame?)
  df1 : (or/c data-frame? grouped-data-frame?)
  df2 : (or/c data-frame? grouped-data-frame?)
  by : string?

Returns a new data-frame, with all rows from df2, and all columns from df1 and df2. Rows are compared by the value of the variable(s) by.

If by is not specified, it defaults to all the common columns in df1 and df2, in an unspecified order.

This is equivalent to (left-join df2 df1 by #:cmp? cmp?).

procedure
(inner-join df1 df2 by ...)
→ (or/c data-frame? grouped-data-frame?)
  df1 : (or/c data-frame? grouped-data-frame?)
  df2 : (or/c data-frame? grouped-data-frame?)
  by : string?

Returns a new data-frame, with all rows from df1 with matching rows in df2, and columns of both df1 and df2. Rows are compared by the value of the variable by.

If by is not specified, it defaults to all the common columns in df1 and df2, in an unspecified order.

If there are multiple matches between the rows of df1 and df2, all combinations of the matches are returned.

Example:

> (~> woodland1
      (inner-join woodland2 "site")
      show)
data-frame: 4 rows x 4 columns
┌─────┬────┬─────────┬───┐
│catch│site│habitat  │day│
├─────┼────┼─────────┼───┤
│12   │b   │grassland│1  │
├─────┼────┼─────────┼───┤
│24   │b   │grassland│2  │
├─────┼────┼─────────┼───┤
│10   │c   │woodland │1  │
├─────┼────┼─────────┼───┤
│20   │c   │woodland │2  │
└─────┴────┴─────────┴───┘

procedure
(full-join df1 df2 by ...)
→ (or/c data-frame? grouped-data-frame?)
  df1 : (or/c data-frame? grouped-data-frame?)
  df2 : (or/c data-frame? grouped-data-frame?)
  by : string?

Returns a new data-frame, with all rows and columns from df1 and df2. Rows are compared by the value of the variable by.

If by is not specified, it defaults to all the common columns in df1 and df2, in an unspecified order.

If there is a row in df1 that does not have a value in df2, or vice versa, they will have NA values in the new columns.

Example:

> (~> woodland2
      (full-join woodland1 "site")
      show)
data-frame: 5 rows x 4 columns
┌─────┬────┬─────────┬───┐
│catch│site│habitat  │day│
├─────┼────┼─────────┼───┤
│#f   │a   │meadow   │#f │
├─────┼────┼─────────┼───┤
│12   │b   │grassland│1  │
├─────┼────┼─────────┼───┤
│24   │b   │grassland│2  │
├─────┼────┼─────────┼───┤
│10   │c   │woodland │1  │
├─────┼────┼─────────┼───┤
│20   │c   │woodland │2  │
└─────┴────┴─────────┴───┘

8.2 Filtering joins🔗ℹ

These joins take columns from the first argument that match some condition on the second argument.

procedure
(semi-join df1 df2 by ...)
→ (or/c data-frame? grouped-data-frame?)
  df1 : (or/c data-frame? grouped-data-frame?)
  df2 : (or/c data-frame? grouped-data-frame?)
  by : string?

Returns a new data-frame, consisting of rows in df1 which have matching rows in df2, keeping only columns from df1.

Unlike an inner join, a semi join will never duplicate rows of df1, whereas an inner join will duplicate rows of df1 for each matching row in df2.

If by is not specified, it defaults to all the common columns in df1 and df2, in an unspecified order.

If there are no matches between the rows of df1 and df2, semi-join will error.

Example:

> (~> woodland1
      (semi-join woodland2)
      show)
data-frame: 2 rows x 2 columns
┌────┬─────────┐
│site│habitat  │
├────┼─────────┤
│b   │grassland│
├────┼─────────┤
│c   │woodland │
└────┴─────────┘

procedure
(anti-join df1 df2 by ...)
→ (or/c data-frame? grouped-data-frame?)
  df1 : (or/c data-frame? grouped-data-frame?)
  df2 : (or/c data-frame? grouped-data-frame?)
  by : string?

Returns a new data-frame, consiting of rows in df1 which do not have matching rows in df2, keeping only columns from df1.

If by is not specified, it defaults to all the common columns in df1 and df2, in an unspecified order.

If all the rows of df1 and df2 match, anti-join will error.

Example:

> (~> woodland1
      (anti-join woodland2)
      show)
data-frame: 1 rows x 2 columns
┌────┬───────┐
│site│habitat│
├────┼───────┤
│a   │meadow │
└────┴───────┘

1	Constructing data-frames
2	Displaying data
3	Grouping and splitting
4	Filtering
5	Slicing
6	Creating and modifying columns
7	Summarizing
8	Joining
9	Sorting
10	Pivoting
11	Separating variables
12	Unnesting
13	Missing values