3 Gapminder

3 Gapminder🔗ℹ

All data visualization starts with data to visualize, and we begin with excerpts of data from Gapminder: more specifically, we begin with a CSV dump of the data in the Gapminder library for R. This data is already tidy and in the format we want, so we merely read it in as a CSV using df-read/csv from the data-frame library:

> (define gapminder (df-read/csv "data/all_gapminder.csv"))

We can take a quick look at our data using the show form from the sawzall library (more on that later!), to determine what data we actually have:

> (show gapminder)
data-frame: 1704 rows x 7 columns
┌──┬────┬───────┬───────────┬───────────┬─────────┐
│#f│year│lifeExp│gdpPercap  │country    │continent│
├──┼────┼───────┼───────────┼───────────┼─────────┤
│1 │1952│28.801 │779.4453145│Afghanistan│Asia     │
├──┼────┼───────┼───────────┼───────────┼─────────┤
│2 │1957│30.332 │820.8530296│Afghanistan│Asia     │
├──┼────┼───────┼───────────┼───────────┼─────────┤
│3 │1962│31.997 │853.10071  │Afghanistan│Asia     │
├──┼────┼───────┼───────────┼───────────┼─────────┤
│4 │1967│34.02  │836.1971382│Afghanistan│Asia     │
├──┼────┼───────┼───────────┼───────────┼─────────┤
│5 │1972│36.088 │739.9811058│Afghanistan│Asia     │
├──┼────┼───────┼───────────┼───────────┼─────────┤
│6 │1977│38.438 │786.11336  │Afghanistan│Asia     │
└──┴────┴───────┴───────────┴───────────┴─────────┘
1698 rows, 1 cols elided
(use (show df everything #:n-rows 'all) for full frame)

So, we know that we have GDP per capita and life-expectancy in our dataset. Let’s say we wanted to make a scatterplot of these two variables versus each other. Using Graphite, that would look like this:

> (graph #:data gapminder
#:mapping (aes #:x "gdpPercap" #:y "lifeExp")
(points))

Let’s break down this code. The main form is graph, which takes a number of keyword arguments. The #:data keyword argument specifies the data-frame that we want to plot.

The #:mapping keyword argument specifies our aes (standing for aesthetics), which dictates how we actually want the data to be shown on the plot. In this case, our mapping states that we want to map the x-axis to the variable gdpPercap, and the y-axis to the variable lifeExp.

Finally, the rest of our arguments dictate our renderers. In this case, the points renderer states that we want each data point to be drawn as a single point.

This plot is fine, but it’s rather unenlightening — we have a lot of blank space towards the bottom. This can be remedied by adding a logarithmic transform on the x-axis. We could specify this manually, but Graphite already has a log transform predefined:

> (graph #:data gapminder
         #:mapping (aes #:x "gdpPercap" #:y "lifeExp")
         #:x-transform logarithmic-transform
         (points))

The #:x-transform keyword argument specifies a transform?, which combines a plot transform and ticks. In this case, we use the logarithmic-transform function, which is already defined.

This plot is starting to look nicer, but it’s still pretty unenlightening. We don’t know anything about each country or how they’re stratified, we can’t figure out how many countries are present at any given point, we can’t extrapolate a meaningful relationship aside from "probably logarithmic-ish" (given our transform), and we haven’t labeled our axes. We can start by adding labels, and setting the alpha value of the renderer to see where more countries are present:

> (graph #:data gapminder
         #:title "GDP per capita vs life expectancy"
         #:x-label "GDP per capita (USD)"
         #:y-label "Life expectancy (years)"
         #:mapping (aes #:x "gdpPercap" #:y "lifeExp")
         #:x-transform logarithmic-transform
         (points #:alpha 0.4))

All we’ve done here is added labels and titles via their eponymous keyword arguments, and added a keyword to the renderer points.

We can then start thinking about relationships between all the data-points. Let’s say we wanted to add a linear fit to our plot. Then, we can use the fit renderer:

> (graph #:data gapminder
         #:title "GDP per capita vs life expectancy"
         #:x-label "GDP per capita (USD)"
         #:y-label "Life expectancy (years)"
         #:mapping (aes #:x "gdpPercap" #:y "lifeExp")
         #:x-transform logarithmic-transform
         (points #:alpha 0.4)
         (fit #:width 3))

Note, crucially, that fit takes into account our transform: despite the fit looking linear here, it is actually a logarithmic fit, since it fits on the transformed data.

fit defaults to a linear fit (of degree 1), but you can instead do a fit using a higher-degree polynomial with the optional #:degree argument:

> (graph #:data gapminder
         #:title "GDP per capita vs life expectancy"
         #:x-label "GDP per capita (USD)"
         #:y-label "Life expectancy (years)"
         #:mapping (aes #:x "gdpPercap" #:y "lifeExp")
         #:x-transform logarithmic-transform
         (points #:alpha 0.4)
         (fit #:degree 3 #:width 3))

but this is ill-advised for the relationship we see here.

Finally, let’s try and extrapolate different relationships for each continent. We can stratify the points alone by using the aesthetic #:discrete-color to points, which lets us pick a categorical variable to change the color on, in this case continent. Each renderer also takes its own mapping, which can be used to map some aesthetic to a variable.

> (graph #:data gapminder
         #:title "GDP per capita vs life expectancy"
         #:x-label "GDP per capita (USD)"
         #:y-label "Life expectancy (years)"
         #:mapping (aes #:x "gdpPercap" #:y "lifeExp")
         #:x-transform logarithmic-transform
         (points #:alpha 0.4 #:mapping (aes #:discrete-color "continent"))
         (fit #:width 3))

Now we’re seeing some notable differences from where we’ve started! We made a scatter plot, transformed its axes, labeled it, and added aesthetics to make it more readable.

1	Deciding what library to use
2	Key forms
3	Gapminder
4	Bar charts
5	Faceting
6	Data wrangling, 101
7	Data wrangling, 201: Wrangle harder