Intuition for "direction" of mapping in ggplot2

When I teach about ggplot2 I struggle to say the mapping in the right "direction." That is, it seems more natural to me to say,

"we map mpg to x and disp to y"

rather than

"we map x to mpg and y to disp"

Is there some intuition/explanation for why it is the other way around? To me, it seems like I have the data, and then I'm trying to put it on (virtual) paper. I just went and looked in the Grammar of Graphics, and it seems like Wilkinson was talking about mapping in this way, with "A map f: S -> P" where S is mathematical space and P is physical space.

Is the R implementation of the grammar of graphics the other way around because of the named arguments? I understand that the arguments go x=mpg, so maybe the phrasing of the mapping is to make sense with the "directionality" of the assignment? But, in other places where we do assignment, we phrase it differently, as in x <- 5 or "x gets 5."

Maybe Hadley knows?

7 Likes

I often think of it - and teach it - as kind of a "call and answer" format. For example, if I'm prompting a student,

"What do we want on the x-axis?" [type in x = and wait for response]

I agree with your first sentence sounding more natural, so maybe something like

"What do we want to map to x? mpg"

FWIW, I also tend to read assignment "backwards" when teaching, as in, "We save the value 5 under the variable name x". This also works with call-and-response, as in

"What value are saving as x?" [Type x <- and wait]

(I realize this is not the answer to your question about philosophy of ggplot convention, just a related thought!)

1 Like

Ah, interesting. I will also say that "What do we want to map to x?" sentence sometimes. But... I will also sometimes say, "What do we want to map mpg to?" Uh oh.

Maybe we're now getting into the bijective/surjective/injective nature of maps.

this is interesting... I guess I never was hung on it because I think about passing parameters to a function. fun(x, y) becomes fun(x=mpg, y=disp) when we call it. and I would say it as, "We tell ggplot that x is the column mpg and y is the column disp.

However I can see that this could be confusing to learners translating from spoken word to code...

1 Like

Yeah, I've been thinking more about what I verbalize after Felienne's rstudio::conf keynote, and this is one of those things where I know what comes out of my mouth is often inconsistent with what is written in books, like r4ds. Consistency is key for learning! And as a professor, I'm supposed to say the "right" thing.

1 Like

In mathematics we denote maps often as

f: X\rightarrow Y

but write

y = f(x)

It's historical convention, but IMHO also intuitive if one is used to read / write from left to right.

The first notation reads easier, if one has compositions of maps

X \xrightarrow{f} Y \xrightarrow{g} Z \quad vs \quad Z \xleftarrow{g} Y \xleftarrow{f} X

the second notation reads easier, if one writes the explicit formula

y=\sin(x)\frac{x+1}{x^2+1}\quad vs \quad\sin(x)\frac{x+1}{x^2+1} = y

In computer science it's similar - most computer language use the variable on the left hand and the expression on the right hand:

z = a + b * i
d = z^2 + 2z -1

or

plot(data, xlab = "The x label", xlim=c(min(data$x),max(data$y)))

Therefore I consider the ggplot notation

aes(
 x=log(t), 
 y=sqrt(y),
 colour=as.factor(paste(height, weight)
)

better than

aes(
 log(t)=x, 
 sqrt(y)=y, 
 as.factor(paste(height, weight)=colour
)

By the way: in R one can write x <- 3 + 4 as well as 3 + 4 -> x :wink:

2 Likes

I love this question! I prefer to map data to the plot, but I think that early in learning visualization the key realization is understanding that the data and visualization are linked. I think the early hurdle is to make explicit this linking and that at this stage the direction of the mapping isn't as important as learning to recognize how visualizations manifest the underlying data.

My working hypothesis is that people have (at least) heuristics or intuition for approaching a visualization when presented with a plot, but may not have ever considered how the data behind the plot would be structured.

I tried playing with an idea of this the last time I presented ggplot2 by starting with a bare, minimal plot and asking students to try to guess which aspects of the data were generating the plot (original slides here). Then, over the next few slides, I slowly reveal annotations that provide hints until eventually it's clear which variables from the data are used for the x- and y-axes, and for shape and color of the points.


My goal was to use visual intuition to bootstrap understanding the connection between data and the plot, which is then reinforced when we start from code and data to create a plot.

To get back to the original question, though, one of the main reasons I prefer vocalizing the connection as mapping data \to plot is due to summarizing plots, like box plots, histograms, etc., where a transformation is applied to the data on its way to the plot. I'm not sure how to best vocalize those mappings, but it does seem conceptually cleaner to say something like mpg is summarized and mapped to y than y is mapped to a summary of mpg.

2 Likes

I also say

instead of the other way around. And demonstrate it as grabbing the data frame column and placing it on the axes -- similar to how one would do it in drag-and-drop software like Fathom.

I hadn't thought about how this might be counter-intuitive to the order in which things show up in x = mpg, that's an interesting view. Though I think I (and I imagine students) would find it weirder if the order was mpg = x since it's so unlike how we do assignment elsewhere in R.

2 Likes

Thanks to everyone for their thoughts! It sounds like I'm not the only one who says it "backwards" sometimes. In fact, Lucy pointed out on twitter that even Hadley writes it this way in the ggplot2 book!

I really appreciate the mathematical approach from g.k about the way we write maps in mathematics. That will help me when I want to get really technical with it.

I agree with grrrck that the important thing is the connection in students' brains between the data and the plot. My guess is that the particular way I'm saying these sentences probably isn't the thing that's confusing students (or, it isn't the only thing :joy:).

But again, it makes me think of mapping in mathematics. I think that if you have overplotting, or repeated data elements in the variables you're plotting, the mapping is not injective or surjective. That is, if you have two rows with the same mpg and disp values, it's not clear which of the two (x,y) points go with which row, and if you pick an (x,y) point it's not clear which row it goes with. If we ignore overplotting, I think mappings are bijective, so perhaps it doesn't matter which order we say it in.

So, the question becomes, is there a right way to say the mapping?

If not, I'll just keep saying the sentence that makes sense in my brain. If there is a right way (and it's not the way I say it), I'd like to get some intuition about it so it sticks in my head, and I can explain it to students. Again, Hadley said on twitter that he says "we map colour to continent" but didn't provide any further reasoning. PirateGrunt got the closest to an intuition on twitter, which is that the plot is blank everywhere or filled with defaults until we map some data into it (my interpretation of his words). Anyone else with a mental model they could verbalize for me?

4 Likes

This is one of those where I'm reluctant to say that an answer is "backwards". The implication is that backwards is either counterintuitive or wrong. It may be consistent with the way ideas are expressed in another context, and that may aid understanding. But that's only useful if the learner prioritizes that context. My math skills are largely self-taught, which means I more or less never use the term "mapping" when thinking of functions and I've probably never written x\mapsto y.

Especially with R, my context is programming where x <- 5 is something that's probably the opposite of how I would describe the expression, i.e. "assign 5 to x".

For me, one of the great epiphanies of working with ggplot was grasping the idea that all of the visual elements are like empty boxes. As designers, we may fill them with whatever we wish. This was an approach that was very different from the Excel (and others) notion of starting with a blank space and then "adding" things to it (like secondary axes ).

Final thought: for any single plot, I think the mapping is bidirectional. (Someone mathier than me will likely show how that's not so.) That is, I can take a plot and construct a data frame, or take a data frame and construct a plot.

Just two thoughts:

There is sometimes a tension between the ease of learning and the ease of use. Think of a typewriter's keyboard: it would be much more intuitive to learn typewriting if all letters where in one row in their alphabetic order. But for fast typewriting it's for sure better to have them arranged in multiple rows, and non alphabetic sorting (QWERTY) may let you type even faster.

Although I'm pleased by intuitive and easy to learn things, I learned the value of mental obstacles. If something is intuitive, I can use it without having to think about it. But while I'm not forced to think about it, I also may not understand it. If something goes wrong or if I have a non standard-problem, I will be stuck. If I was forced to learn it the hard way, I aquired a deeper understanding of the problem and can pull myself out by my own bootstraps.

Back to ggplot: once I realized, that there is a mapping from, let's call it, data-space to a visual space - and that it is indeed a (mathematical) mapping (seldom surjective, often not injective and almost never bijective) and not just a phrase "map continent to colour" that one has to translate (unintuitively) to colour=continent - it seemed natural (and intuitive) to me to use algebraic expressions like y = -SoilLayer or x = OrganicNitrogen + InorganicNitrogen or to use constant expression outside of aes() like colour="red".

To avoid misunderstandings: I don't advocate for making things extra complicated, I prefer to make things simple, but not too simple. :wink:

1 Like

This is a great (and helpful) thread.

In teaching the tidyverse suite to beginners, the biggest problems I encounter with students are (1) rename, and (2) case_when with mutate. The rename issue is obvious ... they keep assuming (despite repeated instructions otherwise) that the column/variable name to be renamed will occur right after the first open paren, the new name after the = sign. The problems I encounter teaching case_when are more multidimensional, and frankly, I have more-than-once trashed initial explanations of it myself.

I'm getting better, though, and posts and discussions like this help a lot!

Interesting. Thinking about it I realized that I read x and y inside a ggplot command as shorthand for x-axis and y-axis, as in "my x-axis is mpg and my y-axis is disp", and not as a variable assignment as such. That may be from many years of plotting things by hand.

2 Likes

I don't think I use the word "map" when talking about this, more something about "the x of the plot is...". I'm also usually thinking (when teaching this stuff) about what kind of variables I am going to make a plot of, like "I have one quantitative variable height and one categorical one gender, so I need to make a boxplot, and on a boxplot the groups go this way (gestures side to side)", which leads into the x of this plot being gender and y being height.

I find rename confusing myself also, and have to bash "new name equals old name" into my head. I suspect this and case_when mostly require practice at using and teaching.

I see we're on the same wavelength here. (See my reply to jdlong.)

Yes, I always want to read it "rename x to y". So I have to look it up every time to get the order right. It might be clearer (to me) if it was rename(oldname, newname), except that then you couldn't chain several renames in one command. There is an interesting section in "The Design of Everyday Things" by Don Norman, where he talks about early debates for mapping arrow keys, mouse movements, and finger swipes to which way a screen will scroll. In that case the mental model can be "am I moving a viewport across the text, or am I moving the text inside a viewport". Of course, both models get used.

1 Like

I get confused when students come to me with Mac laptops, because the direction of the two-finger scroll on the touchpad is the opposite to what I'm used to.

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.