Identify is not accurate

Here is the problem:
When I use cook's distance to check influential points in SLR, I used two methods.
First one:

plot(mortality.model, which = 4)

This one gives me the correct answer. Second one:

plot(cooks.distance(mortality.model), type = 'p')
identify(cooks.distance(mortality.model))

This one gives me the wrong answer, but very close to the correct answer.
Read the data set:

df.mortality <- read.csv("mortality.csv", header = TRUE)

Build the model:

mortality.model <- lm(log(infant) ~ log(income))

By the way, the dataset has NA values. The dput result:

structure(list(X = structure(c(4L, 5L, 7L, 15L, 23L, 29L, 30L, 101L,
41L,43L, 46L, 61L, 62L, 66L, 73L, 79L, 86L, 87L, 10L, 97L, 2L, 25L, 38L,
39L, 40L, 52L, 65L, 75L, 100L, 3L, 9L, 18L, 19L, 21L, 24L, 32L, 33L, 42L,
45L, 50L, 55L, 58L, 63L, 68L, 71L, 77L, 83L, 89L, 93L, 94L, 99L, 103L,
105L, 8L, 14L, 20L, 26L, 27L, 31L, 36L, 44L, 47L, 80L, 51L, 59L, 69L, 70L,
72L, 88L, 91L, 95L, 81L, 1L, 6L,11L, 12L, 13L, 16L, 17L, 22L, 28L, 34L,
35L, 37L, 48L, 49L, 53L, 54L, 56L, 57L, 60L, 64L, 67L, 74L, 76L, 78L, 84L,
85L, 90L, 92L, 96L, 98L, 82L, 102L, 104L), .Label = c("Afganistan",
"Algeria", "Argentina", "Australia", "Austria", "Bangladesh","Belgium",
"Bolivia", "Brazil", "Britain", "Burma","Burundi","Cambodia","Cameroon",
"Canada", "Central.African.Republic", "Chad","Chile", "Colombia","Congo",
"Costa.Rica", "Dahomey", "Denmark", "Dominican.Republic", "Ecuador",
"Egypt", "El.Salvador", "Ethiopia", "Finland", "France", "Ghana",
"Greece", "Guatemala", "Guinea", "Haiti", "Honduras", "India",
"Indonesia", "Iran", "Iraq", "Ireland", "Israel", "Italy", "Ivory.Coast",
"Jamaica", "Japan", "Jordan", "Kenya", "Laos", "Lebanon", "Liberia",
"Libya", "Madagascar", "Malawi", "Malaysia", "Mali", "Mauritania",
"Mexico", "Moroco", "Nepal", "Netherlands", "New.Zealand", "Nicaragua",
"Niger", "Nigeria", "Norway", "Pakistan", "Panama", "Papua.New.Guinea",
"Paraguay", "Peru", "Philippines", "Portugal", "Rwanda", "Saudi.Arabia",
"Sierra.Leone", "Singapore", "Somalia", "South.Africa", "South.Korea",
"South.Vietnam", "Southern.Yemen", "Spain", "Sri.Lanka", "Sudan",
"Sweden", "Switzerland", "Syria", "Taiwan", "Tanzania", "Thailand",
"Togo", "Trinidad.and.Tobago", "Tunisia", "Turkey", "Uganda",
"United.States", "Upper.Volta", "Uruguay", "Venezuela", "West.Germany",
"Yemen", "Yugoslavia", "Zaire", "Zambia"), class = "factor"),
income = c(3426L, 3350L, 3346L, 4751L, 5029L, 3312L, 3403L,
5040L, 2009L, 2298L, 3292L, 4103L, 3723L, 4102L, 956L, 1000L,
5596L, 2963L, 2503L, 5523L, 400L, 250L, 110L, 1280L, 560L,
3010L, 220L, 1530L, 1240L, 1191L, 425L, 590L, 426L, 725L,
406L, 1760L, 302L, 2526L, 727L, 631L, 295L, 684L, 507L, 754L,
335L, 1268L, 1256L, 261L, 732L, 434L, 799L, 406L, 310L, 200L,
100L, 281L, 210L, 319L, 217L, 284L, 387L, 334L, 344L, 197L,
279L, 477L, 347L, 230L, 334L, 210L, 435L, 130L, 75L, 100L,
73L, 68L, 123L, 122L, 70L, 81L, 79L, 79L, 100L, 93L, 169L,
71L, 120L, 130L, 50L, 174L, 90L, 70L, 102L, 61L, 148L, 85L,
162L, 125L, 120L, 160L, 134L, 82L, 96L, 77L, 118L), infant = c(26.7,
23.7, 17, 16.8, 13.5, 10.1, 12.9, 20.4, 17.8, 25.7, 11.7,
11.6, 16.2, 11.3, 44.8, 71.5, 9.6, 12.8, 17.5, 17.6, 86.3,
78.5, 125, NA, 28.1, 300, 58, 650, 51.7, 59.6, 170, 78, 62.8,
54.4, 48.8, 27.8, 79.1, 22.1, 26.2, 13.6, 32, 60.9, 46, 34.1,
65.1, 20.4, 15.1, 19.1, 26.2, 76.3, 40.4, 43.3, 259, 60.4,
137, 180, 114, 58.2, 63.7, 39.3, 138, 21.3, 58, 159.2, 149,
10.2, 38.6, 67.9, 21.7, 27, 153, 100, 400, 124.3, 200, 150,
100, 190, 160, 109.6, 84.2, 216, NA, 60.6, 55, NA, 102, 148.3,
120, 187, NA, 200, 124.3, 132.9, 170, 158, 45.1, 129.4, 162.5,
127, 160, 180, 80, 50, 104), region = structure(c(3L, 4L,
4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 1L, 4L,
4L, 4L, 2L, 1L, 2L, 3L, 3L, 3L, 1L, 1L, 3L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 4L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 4L,
3L, 2L, 1L, 2L, 4L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 3L,
3L, 1L, 1L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L,
1L, 1L, 1L, 1L, 1L, 2L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 3L, 1L,
3L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 1L), .Label = c("Africa",
"Americas", "Asia", "Europe"), class = "factor"), oil = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("no",
"yes"), class = "factor")), class = "data.frame", row.names = c(NA,
-105L))

Here are the results:
Correct: 26th and 28th. Wrong: 25th and 27th
Could anyone explain why it happened?

Thanks

and in fact the 24th observation in the dataset has a null value to supply to the model.
Why not filter out observations with NA's in your important variables as a first step in your workflow ?
or try impute missing values, ...

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

Hi @Harry2908,
It's complicated and is due to the presence of NAs. By using this command:

plot(cooks.distance(mortality.model)) 

you have over-ridden the default plot method for Cooks Distance data. As a result the indexing of the original dataframe gets scrambled and then identify() gets it wrong.
If you stick with the default plot method which is employed when you just use:

plot(mortality.model, which = 4)

the correct indexing is shown on the graph.
So, the take-home message is that you could get identify() to work by manipulating the input data frame and the output but is it necessary?
HTH

Please consider my code.

df.mortality_with_NA <- structure(list(
  X = structure(c(
    4L, 5L, 7L, 15L, 23L, 29L, 30L, 101L,
    41L, 43L, 46L, 61L, 62L, 66L, 73L, 79L, 86L, 87L, 10L, 97L, 2L, 25L, 38L,
    39L, 40L, 52L, 65L, 75L, 100L, 3L, 9L, 18L, 19L, 21L, 24L, 32L, 33L, 42L,
    45L, 50L, 55L, 58L, 63L, 68L, 71L, 77L, 83L, 89L, 93L, 94L, 99L, 103L,
    105L, 8L, 14L, 20L, 26L, 27L, 31L, 36L, 44L, 47L, 80L, 51L, 59L, 69L, 70L,
    72L, 88L, 91L, 95L, 81L, 1L, 6L, 11L, 12L, 13L, 16L, 17L, 22L, 28L, 34L,
    35L, 37L, 48L, 49L, 53L, 54L, 56L, 57L, 60L, 64L, 67L, 74L, 76L, 78L, 84L,
    85L, 90L, 92L, 96L, 98L, 82L, 102L, 104L
  ), .Label = c(
    "Afganistan",
    "Algeria", "Argentina", "Australia", "Austria", "Bangladesh", "Belgium",
    "Bolivia", "Brazil", "Britain", "Burma", "Burundi", "Cambodia", "Cameroon",
    "Canada", "Central.African.Republic", "Chad", "Chile", "Colombia", "Congo",
    "Costa.Rica", "Dahomey", "Denmark", "Dominican.Republic", "Ecuador",
    "Egypt", "El.Salvador", "Ethiopia", "Finland", "France", "Ghana",
    "Greece", "Guatemala", "Guinea", "Haiti", "Honduras", "India",
    "Indonesia", "Iran", "Iraq", "Ireland", "Israel", "Italy", "Ivory.Coast",
    "Jamaica", "Japan", "Jordan", "Kenya", "Laos", "Lebanon", "Liberia",
    "Libya", "Madagascar", "Malawi", "Malaysia", "Mali", "Mauritania",
    "Mexico", "Moroco", "Nepal", "Netherlands", "New.Zealand", "Nicaragua",
    "Niger", "Nigeria", "Norway", "Pakistan", "Panama", "Papua.New.Guinea",
    "Paraguay", "Peru", "Philippines", "Portugal", "Rwanda", "Saudi.Arabia",
    "Sierra.Leone", "Singapore", "Somalia", "South.Africa", "South.Korea",
    "South.Vietnam", "Southern.Yemen", "Spain", "Sri.Lanka", "Sudan",
    "Sweden", "Switzerland", "Syria", "Taiwan", "Tanzania", "Thailand",
    "Togo", "Trinidad.and.Tobago", "Tunisia", "Turkey", "Uganda",
    "United.States", "Upper.Volta", "Uruguay", "Venezuela", "West.Germany",
    "Yemen", "Yugoslavia", "Zaire", "Zambia"
  ), class = "factor"),
  income = c(
    3426L, 3350L, 3346L, 4751L, 5029L, 3312L, 3403L,
    5040L, 2009L, 2298L, 3292L, 4103L, 3723L, 4102L, 956L, 1000L,
    5596L, 2963L, 2503L, 5523L, 400L, 250L, 110L, 1280L, 560L,
    3010L, 220L, 1530L, 1240L, 1191L, 425L, 590L, 426L, 725L,
    406L, 1760L, 302L, 2526L, 727L, 631L, 295L, 684L, 507L, 754L,
    335L, 1268L, 1256L, 261L, 732L, 434L, 799L, 406L, 310L, 200L,
    100L, 281L, 210L, 319L, 217L, 284L, 387L, 334L, 344L, 197L,
    279L, 477L, 347L, 230L, 334L, 210L, 435L, 130L, 75L, 100L,
    73L, 68L, 123L, 122L, 70L, 81L, 79L, 79L, 100L, 93L, 169L,
    71L, 120L, 130L, 50L, 174L, 90L, 70L, 102L, 61L, 148L, 85L,
    162L, 125L, 120L, 160L, 134L, 82L, 96L, 77L, 118L
  ), infant = c(
    26.7,
    23.7, 17, 16.8, 13.5, 10.1, 12.9, 20.4, 17.8, 25.7, 11.7,
    11.6, 16.2, 11.3, 44.8, 71.5, 9.6, 12.8, 17.5, 17.6, 86.3,
    78.5, 125, NA, 28.1, 300, 58, 650, 51.7, 59.6, 170, 78, 62.8,
    54.4, 48.8, 27.8, 79.1, 22.1, 26.2, 13.6, 32, 60.9, 46, 34.1,
    65.1, 20.4, 15.1, 19.1, 26.2, 76.3, 40.4, 43.3, 259, 60.4,
    137, 180, 114, 58.2, 63.7, 39.3, 138, 21.3, 58, 159.2, 149,
    10.2, 38.6, 67.9, 21.7, 27, 153, 100, 400, 124.3, 200, 150,
    100, 190, 160, 109.6, 84.2, 216, NA, 60.6, 55, NA, 102, 148.3,
    120, 187, NA, 200, 124.3, 132.9, 170, 158, 45.1, 129.4, 162.5,
    127, 160, 180, 80, 50, 104
  ), region = structure(c(
    3L, 4L,
    4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 1L, 4L,
    4L, 4L, 2L, 1L, 2L, 3L, 3L, 3L, 1L, 1L, 3L, 2L, 2L, 2L, 2L,
    2L, 2L, 2L, 4L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 4L,
    3L, 2L, 1L, 2L, 4L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 3L,
    3L, 1L, 1L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L,
    1L, 1L, 1L, 1L, 1L, 2L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 3L, 1L,
    3L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 1L
  ), .Label = c(
    "Africa",
    "Americas", "Asia", "Europe"
  ), class = "factor"), oil = structure(c(
    1L,
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
    1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
    1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
  ), .Label = c(
    "no",
    "yes"
  ), class = "factor")
), class = "data.frame", row.names = c(
  NA,
  -105L
))


length(df.mortality_with_NA$income)
length(df.mortality_with_NA$infant)

df.mortality <- na.omit(df.mortality_with_NA)

length(df.mortality$income)
length(df.mortality$infant)

Hi
The dataset was uploaded by the lecturer so we cannot modify the dataset. I did na.omit at first to filter out the NA values in the dataset. However, I found that the length of the x and y variables will be different, which caused problems in building the SLR. That's why I keep the NA values in the dataset.

Hi @DavoWW

I think I got the answer. it is because identify function will skip NA values.

Thanks