Warning message: In dist(Data) : NAs introduced by coercion

I have fixed the function error by calling the library (cluster), however I still have the first error:

Warning message:
In dist(Data) : NAs introduced by coercion

Can I ignore this? Or do I have to address it? If so, how?

well it's a warning, not an error. So you should understand it. Look at your input data, then look at your output data and see where the NAs were created. Then look at the help for dist and see if you can determine why NAs are being produced. Your input data may have missing values, or maybe it has character values. It's hard to guess without seeing your data.

3 Likes

There are none. Here's a sample of the output:

               1            2            3            4
1   0.000000e+00    951.77066   1351.64209    166.89518
               5            6         7           8            9
1   1.352908e+02    258.11690  51088.23   2212.0103    515.47992
              10           11           12           13
1     2367.57026 1.217035e+02    340.10330    330.23676
             14           15           16           17
1     4405.2212 3.313772e+02    193.60513 3.168459e+02
              18           19           20           21
1      422.42520 3.301973e+02    174.90726   1972.93727
              22           23           24           25
1      160.52936   2084.11335    137.17138 9.516972e+01
              26           27           28           29
1      168.36518 3.319189e+02    335.60344 3.331150e+02
             30           31           32           33
1    16337.1944     60.76974 3.310742e+02   8171.72976
              34           35           36          37
1     6540.31174    972.40425    383.52597   1320.2678
             38           39           40           41
1      914.1683    821.99025    566.86831    489.54051
              42           43           44           45
1     3742.81118    174.71192    217.55584    345.37481
              46          47           48           49
1      570.73085   2060.6133    521.34475     29.07055
              50           51           52          53

There's something causing NAs in your data. Absence of evidence in a few lines of data is not evidence of absence in all your data.

If some of your data has NA's you might get that error. So you can look for rows containing NA with something like this:

 Data %>% 
   filter_all(any_vars(is.na(.))) 

you should probably do a str(Data) on your data to make sure everything in your data is of the expected type.

Thanks. Is there a difference between having NA's in your output and NA's introduced? I ask because when I run the following code it didn't list any:

#Check for NAs
str(Data)

Sample result:

'data.frame':	653 obs. of  16 variables:
 $ Gender            : Factor w/ 2 levels "a","b": 2 1 1 2 2 2 2 1 2 2 ...
 $ Age               : num  30.8 58.7 24.5 27.8 20.2 ...
 $ MonthlyExpenses   : int  0 4 0 1 5 4 1 11 0 4 ...
 $ MaritalStatus     : Factor w/ 3 levels "l","u","y": 2 2 2 2 2 2 2 2 3 3 ...
 $ HomeStatus        : Factor w/ 3 levels "g","gg","p": 1 1 1 1 1 1 1 1 3 3 ...

This is really confusing me....

running str does not check for NAs. Running this tells you the structure of your Data object.

The recipe for checking for NAs is in my response directly following the sentence "So you can look for rows containing NA with something like this:"

I made that comment because I thought it would display NA in the data.

I tried the code: Data %>%
filter_all(any_vars(is.na(.)))

I got an error..... so that's why I tried the str(Data), to see why, which helped to identify the error as the user "rensa" has pointed out by highlighting the output.

Now I just need to figure out what to do to resolve this.....

Generally speaking, it sounds like dist() had to coerce some of the values you gave it into compatible data types, and when it did that the result was to (somehow) create NAs.

You might take a good look at the documentation for dist() and see if what you’re asking it to do makes sense. The data frame you’re feeding it has a lot of variables of a lot of different types (you don’t seem to have included all of the str() output above, but already I see both factors and numerics).

dist() uses one of several possible distance measures to “compute the distances between the rows of a data matrix”. It expects to get a matrix of values, but it will try to work with a data frame if that’s what you give it. However, dist() doesn’t know what to do with factors (= categorical data) — I strongly suspect this is the source of your NAs:

# Create a data frame with some categorical (factor) data
# and some numeric data
dfr <- data.frame(
  lc = letters[1:4],
  uc = LETTERS[1:4],
  num1 = c(1, 1, 1, 1),
  num2 = c(0, 1, 0, 1)
)

dfr
#>   lc uc num1 num2
#> 1  a  A    1    0
#> 2  b  B    1    1
#> 3  c  C    1    0
#> 4  d  D    1    1
str(dfr)
#> 'data.frame':    4 obs. of  4 variables:
#>  $ lc  : Factor w/ 4 levels "a","b","c","d": 1 2 3 4
#>  $ uc  : Factor w/ 4 levels "A","B","C","D": 1 2 3 4
#>  $ num1: num  1 1 1 1
#>  $ num2: num  0 1 0 1

# Compute Euclidean distance of the rows, looking only at
# the numeric columns
dist(dfr[3:4])
#>   1 2 3
#> 2 1    
#> 3 0 1  
#> 4 1 0 1

# Compute Euclidean distance of the rows, looking only at
# the factor columns
dist(dfr[1:2])
#> Warning in dist(dfr[1:2]): NAs introduced by coercion
#>    1  2  3
#> 2 NA      
#> 3 NA NA   
#> 4 NA NA NA

# For the whole data frame...
dist(dfr)
#> Warning in dist(dfr): NAs introduced by coercion
#>          1        2        3
#> 2 1.414214                  
#> 3 0.000000 1.414214         
#> 4 1.414214 0.000000 1.414214

Created on 2018-10-01 by the reprex package (v0.2.1)

Compare to what you get if you convert the factors into numeric values yourself, first:

dfr2 <- as.data.frame(lapply(dfr, as.numeric))
dfr2
#>   lc uc num1 num2
#> 1  1  1    1    0
#> 2  2  2    1    1
#> 3  3  3    1    0
#> 4  4  4    1    1

dist(dfr2)
#>          1        2        3
#> 2 1.732051                  
#> 3 2.828427 1.732051         
#> 4 4.358899 2.828427 1.732051

But that result isn’t terribly meaningful!

Thanks for the feedback, it's been very helpful. I have converted the data frame to a matrix:

#convert dataframe to a matrix
data.matrix(Data, rownames.force = NA)

And this is the output:

          Gender   Age        MonthlyExpenses MaritalStatus HomeStatus
   1        2        30.83               0                            2                    1              
   2        1        58.67               4                            2                    1
   3        1        24.50               0                            2                    1
   4        2        27.83               1                            2                    1
   5        2        20.17               5                            2                    1
   6        2        32.08               4                            2                    1
   7        2        33.17               1                            2                    1
   8        1        22.92              11                           2                    1
   9        2        54.42               0                            3                    3
 10        2        42.50               4                            3                    3

I only included the first 10 rows of 62 as a sample, didn't think it would be useful to list the whole table. However, now that I have converted it to a matrix, I received the following:

Warning message:
 In dist(Data, method = "euclidean", diag = FALSE, upper = FALSE) :
   NAs introduced by coercion

So should I use names(Data) to assign a variable name? Or is there another option?

You didn't assign your converted data frame to a variable name, so no changes were actually made to Data. I suspect that's why you're getting the same warning.

I can't go any further without first making sure that you've seen our community guidelines on asking questions related to class assignments: FAQ: Homework Policy. The most important parts are:

  • Please don't post verbatim text from your assignment
  • Please clearly state when a question you post here is related to an assignment

Can you please edit the post where you included verbatim assignment text, changing it to a paraphrase, instead? It's definitely useful to understand what you're trying to do (that's part of the reason for the second rule!), but it needs to be written in your own words.

Thanks for the feedback, I tried to avoid posting verbatim text and using my own words but we seemed to be going around in circles. I figured if I wrote what was required we could resolve this once for all. Your feedback so far has been very helpful, I really appreciate it.

R's funky arrow thing is the assignment operator!

# This just prints the result to the console, but Data hasn't changed
data.matrix(Data)

# This assigns the result to a new name
Data_mat <- data.matrix(Data)

# This assigns the result to the same name as before,
# replacing the old Data data frame with the new Data matrix
Data <- data.matrix(Data)

Thanks for removing the verbatim text from your problem set. I think the paraphrase I might try is: you've been instructed to use the daisy() function from the cluster package to build a Gower dissimilarity object, which you are then supposed to convert into a matrix. Does that sound right?

So, um, given that you've already been pointed in the direction of daisy() — why try to make dist() work in the first place? Your stumbling block has been the non-numeric data in your data set. If you take a look at the documentation for daisy(), you'll read the following:

Compared to dist whose input must be numeric variables, the main feature of daisy is its ability to handle other variable types as well (e.g. nominal, ordinal, (a)symmetric binary) even when different types occur in the same data set.

Thanks for the explanation, it makes more sense... and yes your paraphrase is correct.

I did try to make dist() work in the first place:

#convert the Gower dissimilarity object into a distance matrix
Data<-dist(Data, method = "euclidean", diag = FALSE, upper = FALSE)
dist(Data, method = "euclidean", diag = FALSE, upper = FALSE)
as.matrix(dist(Data))

And this works, it only prints numeric data. However, the next step required doesn't work:

Dist <- as.matrix(Dist)

And I get the following error:

Error in as.matrix(Dist) : object 'Dist' not found

I realise that I need another approach as I have converted the data frame a step too early. And I am obviously missing a qualifier to declare Dist <- as.matrix(Dist).

it's syntactically acceptable to write the output of one step back on the same object you started with. But it will make debugging, testing, and learning incredibly hard. You should really be assigning each step to a new object. Like:

#convert the Gower dissimilarity object into a distance matrix
data_dist_matrix <- dist(Data, method = "euclidean", diag = FALSE, upper = FALSE)

# do something with the dist matrix
Data_dist_matrix_new_process <- new_process(data_dist_matrix)

as for your error, it looks like you are applying the as.matrix function to an object called Dist... where did Dist come from? You show no example of creating that object.

1 Like

This code isn't actually creating a Gower dissimilarity object, because dist() can't do that. Maybe take another look at the daisy() documentation page? It explains the algorithmic differences between the Gower method and other methods.

Here's a simple example of why it doesn't necessarily make sense to compute a distance matrix of categorical data without specifically adjusting the algorithm:

# How different are the rows of this data frame from each other? 
dfr <- data.frame(
  x = c(0, 0, 0),
  y = letters[1:3]
)
dfr
#>   x y
#> 1 0 a
#> 2 0 b
#> 3 0 c

# Converting the categorical data to a numeric
# category count doesn't necessarily yield a sensible result
dfr_num <- data.matrix(dfr)
dfr_num
#>      x y
#> [1,] 0 1
#> [2,] 0 2
#> [3,] 0 3

# Is Row 3 really twice as different from Row 1 as Row 2 is?
dist(dfr_num)
#>   1 2
#> 2 1  
#> 3 2 1

library(cluster)

# daisy handles categorical data more sensibly
daisy(dfr, metric = "gower")
#> Dissimilarities :
#>     1   2
#> 2 0.5    
#> 3 0.5 0.5
#> 
#> Metric :  mixed ;  Types = I, N 
#> Number of objects : 3

Created on 2018-10-02 by the reprex package (v0.2.1)

Thanks..... this is my issue, I have been trying to create the object, but all I get is an error or a warning. I am running out of ideas and time... that's why I have come here for advice. I don't know what to do! I have spent days researching to figure this out and everything I try fails.

Thanks for the reference.... I've had another shot, but I seem to be making it worse:

#Use R function daisy() from package cluster
#compute a Gower dissimilarity (distance) matrix between the data records
#refer to the result as “Dist”.
daisy(Data,metric = c("gower"),stand = FALSE, type = list(data.frame(Data)), weights = rep.int(1, p),
warnBin = warnType, warnAsym = warnType, warnConst = warnType,
warnType = TRUE)

Result:

Error in daisy(Data, metric = c("gower"), stand = FALSE, type = list(), :
x is not a dataframe or a numeric matrix.

I thought that I had to reference "data.frame(Data)" ..... so what am I missing? I din't understand what I need to reference to get this code to work!

Based on the error message, I think you need to check two things:

  1. Is Data still a data frame? I suspect you may have run a lot of different things and maybe have overwritten your original Data with something else. Check by running str(Data). It should give you the same output as in your previous message. If Data is not still a data frame, you may need to restart your R session, clear your Global environment, and re-run the code that originally created Data to get back to your starting place.

  2. I think you are maybe misunderstanding what the type argument is for — it's likely that you don't need it at all, and if you do need it, it needs to be specified differently.

You don't need to specify arguments that already have default values, unless you want to change those defaults. Required arguments are ones that don't have any value pre-specified in the Usage section of the documentation. The only required argument for daisy() is x (the argument where you supply the data).

I think you've gotten yourself turned around a bit with the type argument. This argument is only needed when you have columns in your data frame needing special treatment. From the documentation page for daisy(), type is described as:

list for specifying some (or all) of the types of the variables (columns) in x . The list may contain the following components: "ordratio" (ratio scaled variables to be treated as ordinal variables), "logratio" (ratio scaled variables that must be logarithmically transformed), "asymm" (asymmetric binary) and "symm" (symmetric binary variables). Each component's value is a vector, containing the names or the numbers of the corresponding columns of x . Variables not mentioned in the type list are interpreted as usual (see argument x ).

And when we follow the instruction to "see argument x", we find:

numeric matrix or data frame, of dimension nĂ—p, say. Dissimilarities will be computed between the rows of x . Columns of mode numeric (i.e. all columns when x is a matrix) will be recognized as interval scaled variables, columns of class factor will be recognized as nominal variables, and columns of class ordered will be recognized as ordinal variables. Other variable types should be specified with the type argument. Missing values ( NA s) are allowed.

It takes some work to parse through all of this, but the upshot is:

  • You only need to specify anything for type if there are columns in your data frame that are not your basic numeric, factor, or ordered factor variables.
  • The only other types of columns you can specify are: ratio-scaled variables to be treated as ordinals, ratio-scaled variables to be log transformed, asymmetric binary variables, and symmetric binary variables.

If you don't know for sure that you have columns needing this sort of special treatment, don't specify the type argument at all. (If you did need to specify it, you'd do so like:
type = list(ordratio = c("col_name1", "col_name2"), symm = c("col_name3", "col_name4"))
where "col_nameX" would be an actual column name from your data frame).

Given all of that — what happens if you restart your session, clear your environment, re-create your Data object, and then run the following?

daisy(Data, metric = "gower")
1 Like