Problems with ggplot and the removal of outliers.

Hello everybody,
This is a separate question regarding my data. I utilised the formula which mister andresrcs suggested and it worked wonders with the box plots. However, due to my lack of experience I fail yet again.
I attempted to do a scatter plot, based on Alcohol Units per Week and Individual/Couple Annual Income. I tried to colour the points based on the variable 'Sex', however the console states that the object 'Sex' was not found.
Again I apologise for asking a rather trivial question and I thank you in advance for all your help.

library(ggplot2)
library(dplyr)

df<-data.frame(drating = c(0, -2, -2, 18.2125, 3.587, 0, -2, 0, 0, 0, -2, -2, 1.7155, 0.116, -2,  -2, 0.058, 4.5, 0.808, 0.145), JntInc = c(3L, 3L, 3L, 19L, 19L, 19L, 19L, 8L, 18L, 18L, 18L, 18L, 21L, 21L, 21L, 21L, 6L, 6L, 19L, 19L))

is_outlier <- function(x){return(x < quantile(x, 0.25) - 1.5 * IQR(x) | x > quantile(x, 0.75) + 1.5 * IQR(x))}

incomeplot<-survey%>%select(drating, JntInc)%>%mutate(outlier=is_outlier(drating))%>%filter(drating>=0 & outlier==FALSE & JntInc<=31 & JntInc>0)%>%ggplot(aes(x=drating,y=JntInc,colour=Sex))+geom_point()+labs(title="Alcohol Consumption and Income",x="Alcohol Units", y="Annual Income")

incomeplot

Your sample data doesn't have a Sex variable so your example is not reproducible, try to include all the relevant variables in your sample data.

df<-data.frame(drating = c(0, -2, -2, 18.2125, 3.587, 0, -2, 0, 0, 0, -2, -2, 1.7155, 0.116, -2,  -2, 0.058, 4.5, 0.808, 0.145), JntInc = c(3L, 3L, 3L, 19L, 19L, 19L, 19L, 8L, 18L, 18L, 18L, 18L, 21L, 21L, 21L, 21L, 6L, 6L, 19L, 19L))
names(df)
#> [1] "drating" "JntInc"
2 Likes

I am trying. I really do, but I constantly fail. Here is the relevant screenshot with yet another problem.

I think the problem is that you cannot have a line break between ggplot and the following parenthesis. This is indicated by the wavy red line just before

(aes(x=drating, ...

Put a line break between the pipe %>% and the call to ggplot so that line begins with

ggplot(aes(x=drating, ...

I suggest you put a line break after every pipe to make the code easier to read.

1 Like

Many thanks mister FJCC.
So here is the reproducible example. The one which says: Object 'Sex' not found.
What am I doing wrong? Is there another way to code this in order to remove the extreme values?
Best regards,
M

library(ggplot2)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df<-data.frame(drating = c(0, -2, -2, 18.2125, 3.587, 0, -2, 0, 0, 0, -2, -2, 1.7155, 0.116, -2, -2, 0.058, 4.5, 0.808, 0.145), JntInc = c(3L, 3L, 3L, 19L, 19L, 19L, 19L, 8L, 18L, 18L, 18L, 18L, 21L, 21L, 21L, 21L, 6L, 6L, 19L, 19L), Sex = as.factor(c("Female", "Male", "Female", "Male", "Female", "Female", "Female", "Male", "Female", "Male", "Male", "Female", "Male", "Female", "Female", "Male", "Female", "Male", "Male", "Female")))

is_outlier<-function(x){return(x<quantile(x,0.25)-1.5*IQR(x)|x>quantile(x,0.75)+1.5*IQR(x))}

incomeplot<-df%>%select(drating,JntInc)%>%mutate(outlier=is_outlier(drating))%>%filter(drating>=0 & outlier==FALSE &JntInc<=31 & JntInc>0)%>%ggplot(aes(x=drating,y=JntInc,colour=Sex)+geom_point()+labs(title="Alcohol Consumption Based on Income", "Alcohol Units", "Annual Income"))
#> Error in aes(x = drating, y = JntInc, colour = Sex) + geom_point() + labs(title = "Alcohol Consumption Based on Income", : non-numeric argument to binary operator

Created on 2019-04-08 by the reprex package (v0.2.1)

This title sounds like a clickbait, BTW

1 Like

Hello,
I changed the title to something more appropriate.
Best regards,
M

1 Like

The two major things I changed:

  1. In one step of making incomeplot you used a select() function to choose only the columns drating and JntInc. After that step your data frame no longer has a Sex column.
  2. The parentheses in your ggplot call were incorrect. I put a closing parenthesis at the end off ggplot() and removed one at the end of labs().
library(ggplot2)
library(dplyr)

df <- data.frame(drating = c(0, -2, -2, 18.2125, 3.587, 0, -2, 0, 0, 0, -2, -2, 1.7155, 
                           0.116, -2, -2, 0.058, 4.5, 0.808, 0.145), 
               JntInc = c(3L, 3L, 3L, 19L, 19L, 19L, 19L, 8L, 18L, 18L, 18L, 18L, 21L, 
                          21L, 21L, 21L, 6L, 6L, 19L, 19L), 
               Sex = as.factor(c("Female", "Male", "Female", "Male", "Female", "Female", 
                                 "Female", "Male", "Female", "Male", "Male", "Female", 
                                 "Male", "Female", "Female", "Male", "Female", "Male", 
                                 "Male", "Female")))

is_outlier<-function(x){
  return(x<quantile(x,0.25)-1.5*IQR(x)|x>quantile(x,0.75)+1.5*IQR(x))
  }

incomeplot <- df %>% #select(drating,JntInc)%>% 
  mutate(outlier=is_outlier(drating))%>% 
  filter(drating>=0 & outlier==FALSE & JntInc<=31 & JntInc>0)%>%
  ggplot(aes(x = drating, y = JntInc, colour = Sex))+ 
           geom_point()+ 
           labs(title="Alcohol Consumption Based on Income", 
                x = "Alcohol Units", 
                y = "Annual Income")
incomeplot

Created on 2019-04-08 by the reprex package (v0.2.1)

2 Likes

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.