Plotting a scatter plot with categorical data.

Hey R users: a newbie here.

I'm trying to get a plot in R that would look something like A

Y axis is just names that are not important at this moment. I guess you could say it is some kind of density plot but with all points visible.

I used regular plot

plot(data.frame(x,y))

and I get plotted numerical position of the Y.(B)

How do I get this organized so it looks like the first plot? Or is there a simpler way?

Thank you!

# there are several ways to approach this
# let's use the penguins data to illustrate

# install penguins data
remotes::install_github("allisonhorst/palmerpenguins")
#> Using github PAT from envvar GITHUB_PAT
#> Skipping install of 'palmerpenguins' from a github remote, the SHA1 (95e62697) has not changed since last install.
#>   Use `force = TRUE` to force installation

# load packages
library(tidyverse)
library(palmerpenguins)
library(ggbeeswarm)
library(ggforce)

# peek at penguins data
glimpse(penguins)
#> Rows: 344
#> Columns: 7
#> $ species           <chr> "Adelie", "Adelie", "Adelie", "…
#> $ island            <chr> "Torgersen", "Torgersen", "Torg…
#> $ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.…
#> $ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.…
#> $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 18…
#> $ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 365…
#> $ sex               <chr> "MALE", "FEMALE", "FEMALE", NA,…

# clunky jitter version
ggplot(data = penguins) +
  aes(x = body_mass_g, y = species) +
  geom_jitter()
#> Warning: Removed 2 rows containing missing values
#> (geom_point).


# lined up beeswarm version
ggplot(data = penguins) +
  aes(y = body_mass_g, x = species) +
  geom_beeswarm() +
  coord_flip()
#> Warning: Removed 2 rows containing missing values
#> (position_beeswarm).


# version that corresponds to geom_violin with geom_sina
ggplot(data = penguins) +
  aes(y = body_mass_g, x = species) +
  geom_sina() +
  coord_flip()
#> Warning: Removed 2 rows containing non-finite values
#> (stat_sina).


# geom_sina with geom_violin
ggplot(data = penguins) +
  aes(y = body_mass_g, x = species) +
  geom_violin() +
  geom_sina() +
  coord_flip()
#> Warning: Removed 2 rows containing non-finite values
#> (stat_ydensity).
#> Warning: Removed 2 rows containing non-finite values
#> (stat_sina).

Created on 2020-06-11 by the reprex package (v0.3.0)

1 Like

Sweet! Thank you so much! :star_struck:

1 Like

Note that you can adjust how tightly the ggbeeswarm points are packed with the cex=1 argument.

library(tidyverse)
library(palmerpenguins)
library(ggbeeswarm)
library(ggforce)

# peek at penguins data
glimpse(penguins)
#> Rows: 344
#> Columns: 7
#> $ species           <chr> "Adelie", "Adelie", "Adelie", "…
#> $ island            <chr> "Torgersen", "Torgersen", "Torg…
#> $ culmen_length_mm  <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.…
#> $ culmen_depth_mm   <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.…
#> $ flipper_length_mm <dbl> 181, 186, 195, NA, 193, 190, 18…
#> $ body_mass_g       <dbl> 3750, 3800, 3250, NA, 3450, 365…
#> $ sex               <chr> "MALE", "FEMALE", "FEMALE", NA,…

ggplot(data = penguins) +
  aes(y = body_mass_g, x = species) +
  geom_beeswarm(cex = 0.5) +
  coord_flip()
#> Warning: Removed 2 rows containing missing values
#> (position_beeswarm).


ggplot(data = penguins) +
  aes(y = body_mass_g, x = species) +
  geom_beeswarm(cex = 1.5) +
  coord_flip()
#> Warning: Removed 2 rows containing missing values
#> (position_beeswarm).


ggplot(data = penguins) +
  aes(y = body_mass_g, x = species) +
  geom_beeswarm(cex = 2.5) +
  coord_flip()
#> Warning: Removed 2 rows containing missing values
#> (position_beeswarm).

Created on 2020-06-11 by the reprex package (v0.3.0)

1 Like

Good tip!
I went with a mixture of geom_sina and sunflower plot.

A quick question: why do we need to use coord_flip? I know what it does but it seems that you cannot go without it here by simply reassigning axes.

1 Like

It appears to be an oddity with ggbeeswarm, which assumes that your categories should be on the x axis and continuous variable on the y axis. This was probably a behavior inherited from ggplot2. This will probably change in the future, as ggplot 3.3.0 (as of Mar 5 2020) now has bi-directional geoms and stats. See https://www.tidyverse.org/blog/2020/03/ggplot2-3-3-0/

I would guess that the next version of ggbeeswarm will no longer make this assumption, as it is an extension of ggplot.

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.