Why some dots in the Scatter Plot not appear in the legend and neither in the graph?

Hi!
I´m plotting my data frame (9 columns and 264.000 rows ≈):

ID1	         ID2	     dN	         dS	         Omega     Label_ID1    Label_ID2	Group
ABD75601 	 ABD75577 	 0.0003 	 0.0022 	 0.1428 	 HKU1-CoV 	 HKU1-CoV 	 Intra
ABD75601 	 ABD75585 	 0.0003 	 0.0011 	 0.2859 	 HKU1-CoV 	 HKU1-CoV 	 Intra
ABD75601 	 ABD75593 	 0.0006 	 0.0022 	 0.2858 	 HKU1-CoV 	 HKU1-CoV 	 Intra
ABD75609 	 NP_073551 	 1.0011 	 1.2589 	 0.7952 	 HKU1-CoV 	 229E-CoV 	 Inter
ABD75609 	 QJY77946 	 1.0025 	 1.1785 	 0.8507 	 HKU1-CoV 	 229E-CoV 	 Inter

My script for plot this:

df_S_Cold %>%
  ggplot(aes(x = dN, y = dS)) + 
  geom_point(aes(color = Label_ID2), size = 2, alpha=0.5) +
  scale_y_continuous(trans='log10') +
  scale_x_continuous(trans='log10') +
  labs(title = "S Protein",
       subtitle = "Cold variants",
       x = "dN rate",
       y = "dS rate",
       color = "Comparison") +
  scale_color_manual(labels=c(
    "OC43-CoV vs NL63-CoV", 
    "OC43-CoV vs 229E-CoV", 
    "OC43-CoV vs HKU1-CoV",
    "OC43-CoV vs OC43-CoV",
    "HKU1-CoV vs 229E-CoV",
    "HKU1-CoV vs NL63-CoV",
    "HKU1-CoV vs HKU1-CoV",
    "NL63-CoV vs 229E-CoV",
    "NL63-CoV vs NL63-CoV",
    "229E-CoV vs 229E-CoV"), 
    values=c(
      "#dd6d5d", # Red
      "#ddad5d", # Yellow
      "#ad5ddd", # Purple 
      "#5ddd6d", # Green
      "#dd8d5d", # Orange
      "#5d6ddd", # Blue
      "#5dddcd", # Light blue
      "#703818", # Brown
      "#ffd38b", # Light Yellow and Pink
      "#ffa5c7")) +
  theme_gray() + 
  theme(axis.title = element_text()) +
  theme(legend.position = "bottom") 

This result in this plot:

My question is: Why only there are 4 comparisons, when must there 10?
Is there an explanation for that?
Any suggestion or comment on the possible problem?
Thank!

From the small portion of the data that you have shown, I can't guarantee that my idea is correct, but it seems that you make mistake with your scales.

In the geom_point() you assign scale color = Label_ID2 and later you rename it to "Comparison" and assign labels like "OC43-CoV vs NL63-CoV" etc. But your color scale is still Label_ID2, which as I can judge from your comparisons, has only 4 values! You should avoid usage of unnamed vectors to generated labels in scale_._manual() because it error-prone.

You need to use interaction() in the assignment of the color or create a separate column "comparison" in your data.frame. In your case, if you go with interaction() - use argument sep = " vs " to generate names in the correct format and avoid unsafe reassignment of labels in the scale_color_manual() (see option 3).

library("tidyverse")

df <- mpg

# Option 1
df %>%
  filter(cyl %in% c(4, 6)) %>%
  ggplot(aes(x = cty,
             y = hwy,
             color = interaction(drv, cyl)))+
  geom_point() +
  scale_color_manual(labels = c("4.4" = "4WD x 4 cyl",
                                "f.4" = "FWD x 4 cyl",
                                "4.6" = "4WD x 6 cyl",
                                "f.6" = "FWD x 6 cyl",
                                "r.6" = "RWD x 6 cyl"),
                     values = c("red", "green", "magenta",
                                "blue", "black"))

# Option 2
df %>%
  mutate(comparison = str_c(str_to_upper(drv), "WD x ",
                            cyl, " cyl", sep = "")) %>%
  filter(cyl %in% c(4, 6)) %>%
  ggplot(aes(x = cty,
             y = hwy,
             color = comparison))+
  geom_point()

# Option 3
df %>%
  filter(cyl %in% c(4, 6)) %>%
  ggplot(aes(x = cty,
             y = hwy,
             color = interaction(drv, cyl, sep = " vs ")))+
  geom_point()
1 Like

Thank for the answer, this is awesome.
Work fine!!

I have only one question: in the script, I suppose that I must delete the line: "scale_color_manual"? Because I change the line "geom_point" and the used "color = interaction(... " vs").
For example, I don´t delete the "scale_color_manual".
The script:

df_S_Cold %>%
  ggplot(aes(x = dN, y = dS, color = interaction(Label_ID1, Label_ID2, sep = "vs"))) + 
  geom_point() +
  scale_y_continuous(trans='log10') +
  scale_x_continuous(trans='log10') +
  labs(title = "S Protein",
       subtitle = "Cold variants",
       x = "dN rate",
       y = "dS rate",
       color = "Comparison") +
  scale_color_manual(labels=c(
    "OC43-CoV vs NL63-CoV", 
    "OC43-CoV vs 229E-CoV", 
    "OC43-CoV vs HKU1-CoV",
    "OC43-CoV vs OC43-CoV",
    "HKU1-CoV vs 229E-CoV",
    "HKU1-CoV vs NL63-CoV",
    "HKU1-CoV vs HKU1-CoV",
    "NL63-CoV vs 229E-CoV",
    "NL63-CoV vs NL63-CoV",
    "229E-CoV vs 229E-CoV"), 
    values=c(
      "#dd6d5d", # Red
      "#ddad5d", # Yellow
      "#ad5ddd", # Purple 
      "#5ddd6d", # Green
      "#dd8d5d", # Orange
      "#5d6ddd", # Blue
      "#5dddcd", # Light blue
      "#703818", # Brown
      "#ffd38b", # Light Yellow and Pink
      "#ffa5c7")) +
  theme_gray() + 
  theme(axis.title = element_text()) +
  theme(legend.position = "bottom") 

This result in this plot:

Any comment for delete the scale_color_manual?
Thank!

For example, this work fine too:

df_S_Cold %>%
  ggplot(aes(x = dN, y = dS, color = interaction(Label_ID1, Label_ID2, sep = "vs"))) + 
  geom_point() +
  scale_y_continuous(trans='log10') +
  scale_x_continuous(trans='log10') +
  labs(title = "S Protein",
       subtitle = "Cold variants",
       x = "dN rate",
       y = "dS rate",
       color = "Comparison") +
  theme_gray() + 
  theme(axis.title = element_text()) +
  theme(legend.position = "bottom") 

Plot:

Once again,

When you use a non-named vector in the scale_*_manual() you are in danger to assign the wrong label to your variables, because ggplot doesn't know how to match old labels with a new one and do it in "the order of appearance" (perhaps somewhere in the documentation you can find the exact way ggplot makes this order). Here is an example, to show what I mean (read comments in the code)

suppressMessages(library(tidyverse))

# Simple data frame with two observations and 
# the grouping variable 
(df <- tribble(
  ~Num1, ~Num2, ~gr,
  1, 1, "one",
  10, 10, "ten"
))
#> # A tibble: 2 x 3
#>    Num1  Num2 gr   
#>   <dbl> <dbl> <chr>
#> 1     1     1 one  
#> 2    10    10 ten

# If we use scale_color_manual() just to change colors
# we will get everything right
ggplot(df, aes(x = Num1, y = Num2, color = gr))+
  geom_point()+
  scale_color_manual(values = c("red", "blue"))

image

# Now let's try to change the labels so they are showed as numbers
ggplot(df, aes(x = Num1, y = Num2, color = gr))+
    geom_point()+
    scale_color_manual(values = c("red", "blue"),
                       labels = c("1", "10"))

image

# But if we put our labels in the wrong order we will get a non-sense result!
# Now our (1,1) point has label "10", while (10, 10) point has label "1"
ggplot(df, aes(x = Num1, y = Num2, color = gr))+
  geom_point()+
  scale_color_manual(values = c("red", "blue"),
                     labels = c("10", "1"))

image

# If you want to change labels with scale_*_manual()
# use named vector like in this example, then you will be safe, 
# because order doesn't matter anymore
  ggplot(df, aes(x = Num1, y = Num2, color = gr))+
    geom_point()+
    scale_color_manual(values = c("red", "blue"),
                       labels = c("ten" = "10", "one" = "1"))

image

This is exactly what happened in your original example and it is also repeated in your new version when you leave scale_color_manual(labels = c(...)).

Here are your plots. You may see that the same groups of points have different labels. Obviously, if you mismatch labels in the plot further analysis of the results will lead you to a wrong conclusion.

Hence, in your specific example, I would use scale_color_manual() leaving only argument values = c(...) (because some of the default colors are difficult to distinguish). Also, note changes in the theme().

df_S_Cold %>%
  ggplot(aes(x = dN, y = dS, color = interaction(Label_ID1, Label_ID2, sep = "vs"))) + 
  geom_point() +
  scale_y_continuous(trans='log10') +
  scale_x_continuous(trans='log10') +
  labs(title = "S Protein",
       subtitle = "Cold variants",
       x = "dN rate",
       y = "dS rate",
       color = "Comparison") +
  scale_color_manual(values=c(
      "#dd6d5d", # Red
      "#ddad5d", # Yellow
      "#ad5ddd", # Purple 
      "#5ddd6d", # Green
      "#dd8d5d", # Orange
      "#5d6ddd", # Blue
      "#5dddcd", # Light blue
      "#703818", # Brown
      "#ffd38b", # Light Yellow and Pink
      "#ffa5c7")) +
  theme_gray() + # Perhaps you don't need this, because theme_gray() is default
  theme(axis.title = element_text(),
        legend.position = "bottom") 

Please, let me know if you have any further questions.

1 Like

I got it!
Thank you for correct my mistakes.
I did the changes in the script:
Only specific changes (only values=c):

scale_color_manual(values=c("#dd6d5d, ...

and delete the line for :

theme_gray()

Again, Thank you for your help and for the time!!!!

1 Like

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.