setdiff function requires x,y inputs of the same class. If x is a dataframe and y is a vector how do I use this function?

mmarion · April 6, 2023, 12:42am

I worked out something that is running ok. If there is another way please let me know!


zf <- df2
id_list <- list()

for(i in 1:11) {
v <- paste("x",i,sep="")  
print(v)

data_outliers <- subset(zf, abs(df3[[v]]) > zstat)
numOutliers <- dim(data_outliers)[1] 
cat("Number of outliers is ", numOutliers, "\n")
cat("outlier ids are","\n")
print(data_outliers$id)

id_list[[i]] = unlist(data_outliers$id)

data_nooutliers <- subset(zf, abs(df3[[v]])< zstat)
numNooutliers <- dim(data_nooutliers)[1] 
cat("\n","Number of nooutliers is ", numNooutliers, "\n")
cat("first 25 nooutlier ids are","\n")
print(head(data_nooutliers$id,25))

cat("\n","numOutliers+numNooutliers=",numOutliers+numNooutliers,"\n","\n")
}

# a is a vector of unique id values that are outliers
# outlier.df is an extraction of df2 containing all outlier rows
a <- unique(unlist(id_list))
class(a) #integer vector
head(a,10) # 4 46 57 114 198 206 207 210 213 242 ...
length(a)  # 2173
outlier.df <- df2[a,]
dim(outlier.df) # 2173 x 17
names(outlier.df)
head(outlier.df,10) #id column: 4 46 57 114 198 206 207 210 213 242

# keep.df is a dataframe consisting of ids not in a
# keep.df is a dataframe where all predictors x1:x11 are not outliers
keep.df <- subset(df2,!id %in% a)
class(keep.df) #dataframe
dim(keep.df) # 4324 x 17
head(keep.df,10) # id column: 7 10 11 12 17 19 21 22 24 25

# in summary, we have outlier.df and keep.df dataframes
# df7 is a dataframe where predictors x1:x11 have no outliers
df7 <- keep.df

technocrat · April 7, 2023, 10:10am

The lack of the data spoils what would otherwise be a reprex (see the FAQ). Many more people are willing to help with concrete code if they don't have to reverse engineer the question.

mmarion · April 7, 2023, 1:20pm

ok. it is just that it takes time.

mmarion · April 9, 2023, 5:53pm

The main reason I don't do reprex is the dataset. I just submitted a question where I used the upload button to upload the data as a jpg image. How would I do that using reprex? I work with csv files mostly and print out the data set when I can.

jrkrideau · April 9, 2023, 6:33pm

A handy way to supply some sample data is the dput() function. In the case of a large dataset something like dput(head(mydata, 100)) should supply the data we need. Just do dput(mydata) where mydata is your data. Copy the output and paste it here.

technocrat · April 9, 2023, 6:37pm

The data just has to be representative to the extent to illustrate the problem. It doesn't have to be all the data, your data or even fake data just so long as the code produces the same problem for which help is needed. Ideally, a reprex (see the FAQ is cut-and-paste into a session and allows everyone to get to the same sticking point.

The simplest way to include representative data is to use

dput(my_data_object)

Here's an example with mtcars

# with dput(mtcars)
the_data <- structure(list(
  mpg = c(
    21, 21, 22.8, 21.4, 18.7, 18.1, 14.3,
    24.4, 22.8, 19.2, 17.8, 16.4, 17.3, 15.2, 10.4, 10.4, 14.7, 32.4,
    30.4, 33.9, 21.5, 15.5, 15.2, 13.3, 19.2, 27.3, 26, 30.4, 15.8,
    19.7, 15, 21.4
  ), cyl = c(
    6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8,
    8, 8, 8, 8, 8, 4, 4, 4, 4, 8, 8, 8, 8, 4, 4, 4, 8, 6, 8, 4
  ),
  disp = c(
    160, 160, 108, 258, 360, 225, 360, 146.7, 140.8,
    167.6, 167.6, 275.8, 275.8, 275.8, 472, 460, 440, 78.7, 75.7,
    71.1, 120.1, 318, 304, 350, 400, 79, 120.3, 95.1, 351, 145,
    301, 121
  ), hp = c(
    110, 110, 93, 110, 175, 105, 245, 62, 95,
    123, 123, 180, 180, 180, 205, 215, 230, 66, 52, 65, 97, 150,
    150, 245, 175, 66, 91, 113, 264, 175, 335, 109
  ), drat = c(
    3.9,
    3.9, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,
    3.07, 3.07, 3.07, 2.93, 3, 3.23, 4.08, 4.93, 4.22, 3.7, 2.76,
    3.15, 3.73, 3.08, 4.08, 4.43, 3.77, 4.22, 3.62, 3.54, 4.11
  ), wt = c(
    2.62, 2.875, 2.32, 3.215, 3.44, 3.46, 3.57, 3.19,
    3.15, 3.44, 3.44, 4.07, 3.73, 3.78, 5.25, 5.424, 5.345, 2.2,
    1.615, 1.835, 2.465, 3.52, 3.435, 3.84, 3.845, 1.935, 2.14,
    1.513, 3.17, 2.77, 3.57, 2.78
  ), qsec = c(
    16.46, 17.02, 18.61,
    19.44, 17.02, 20.22, 15.84, 20, 22.9, 18.3, 18.9, 17.4, 17.6,
    18, 17.98, 17.82, 17.42, 19.47, 18.52, 19.9, 20.01, 16.87,
    17.3, 15.41, 17.05, 18.9, 16.7, 16.9, 14.5, 15.5, 14.6, 18.6
  ), vs = c(
    0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0,
    0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1
  ), am = c(
    1,
    1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
    0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1
  ), gear = c(
    4, 4, 4, 3,
    3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3,
    3, 3, 4, 5, 5, 5, 5, 5, 4
  ), carb = c(
    4, 4, 1, 1, 2, 1, 4,
    2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2, 2, 4, 2, 1,
    2, 2, 4, 6, 8, 2
  )
), row.names = c(
  "Mazda RX4", "Mazda RX4 Wag",
  "Datsun 710", "Hornet 4 Drive", "Hornet Sportabout", "Valiant",
  "Duster 360", "Merc 240D", "Merc 230", "Merc 280", "Merc 280C",
  "Merc 450SE", "Merc 450SL", "Merc 450SLC", "Cadillac Fleetwood",
  "Lincoln Continental", "Chrysler Imperial", "Fiat 128", "Honda Civic",
  "Toyota Corolla", "Toyota Corona", "Dodge Challenger", "AMC Javelin",
  "Camaro Z28", "Pontiac Firebird", "Fiat X1-9", "Porsche 914-2",
  "Lotus Europa", "Ford Pantera L", "Ferrari Dino", "Maserati Bora",
  "Volvo 142E"
), class = "data.frame")

system · May 21, 2023, 6:38pm

This topic was automatically closed 42 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.