# Identify is not accurate

Here is the problem:
When I use cook's distance to check influential points in SLR, I used two methods.
First one:

plot(mortality.model, which = 4)

This one gives me the correct answer. Second one:

plot(cooks.distance(mortality.model), type = 'p')
identify(cooks.distance(mortality.model))

This one gives me the wrong answer, but very close to the correct answer.

Build the model:

mortality.model <- lm(log(infant) ~ log(income))

By the way, the dataset has NA values. The dput result:

structure(list(X = structure(c(4L, 5L, 7L, 15L, 23L, 29L, 30L, 101L,
41L,43L, 46L, 61L, 62L, 66L, 73L, 79L, 86L, 87L, 10L, 97L, 2L, 25L, 38L,
39L, 40L, 52L, 65L, 75L, 100L, 3L, 9L, 18L, 19L, 21L, 24L, 32L, 33L, 42L,
45L, 50L, 55L, 58L, 63L, 68L, 71L, 77L, 83L, 89L, 93L, 94L, 99L, 103L,
105L, 8L, 14L, 20L, 26L, 27L, 31L, 36L, 44L, 47L, 80L, 51L, 59L, 69L, 70L,
72L, 88L, 91L, 95L, 81L, 1L, 6L,11L, 12L, 13L, 16L, 17L, 22L, 28L, 34L,
35L, 37L, 48L, 49L, 53L, 54L, 56L, 57L, 60L, 64L, 67L, 74L, 76L, 78L, 84L,
85L, 90L, 92L, 96L, 98L, 82L, 102L, 104L), .Label = c("Afganistan",
"Bolivia", "Brazil", "Britain", "Burma","Burundi","Cambodia","Cameroon",
"Egypt", "El.Salvador", "Ethiopia", "Finland", "France", "Ghana",
"Greece", "Guatemala", "Guinea", "Haiti", "Honduras", "India",
"Indonesia", "Iran", "Iraq", "Ireland", "Israel", "Italy", "Ivory.Coast",
"Jamaica", "Japan", "Jordan", "Kenya", "Laos", "Lebanon", "Liberia",
"Libya", "Madagascar", "Malawi", "Malaysia", "Mali", "Mauritania",
"Mexico", "Moroco", "Nepal", "Netherlands", "New.Zealand", "Nicaragua",
"Niger", "Nigeria", "Norway", "Pakistan", "Panama", "Papua.New.Guinea",
"Paraguay", "Peru", "Philippines", "Portugal", "Rwanda", "Saudi.Arabia",
"Sierra.Leone", "Singapore", "Somalia", "South.Africa", "South.Korea",
"South.Vietnam", "Southern.Yemen", "Spain", "Sri.Lanka", "Sudan",
"Sweden", "Switzerland", "Syria", "Taiwan", "Tanzania", "Thailand",
"United.States", "Upper.Volta", "Uruguay", "Venezuela", "West.Germany",
"Yemen", "Yugoslavia", "Zaire", "Zambia"), class = "factor"),
income = c(3426L, 3350L, 3346L, 4751L, 5029L, 3312L, 3403L,
5040L, 2009L, 2298L, 3292L, 4103L, 3723L, 4102L, 956L, 1000L,
5596L, 2963L, 2503L, 5523L, 400L, 250L, 110L, 1280L, 560L,
3010L, 220L, 1530L, 1240L, 1191L, 425L, 590L, 426L, 725L,
406L, 1760L, 302L, 2526L, 727L, 631L, 295L, 684L, 507L, 754L,
335L, 1268L, 1256L, 261L, 732L, 434L, 799L, 406L, 310L, 200L,
100L, 281L, 210L, 319L, 217L, 284L, 387L, 334L, 344L, 197L,
279L, 477L, 347L, 230L, 334L, 210L, 435L, 130L, 75L, 100L,
73L, 68L, 123L, 122L, 70L, 81L, 79L, 79L, 100L, 93L, 169L,
71L, 120L, 130L, 50L, 174L, 90L, 70L, 102L, 61L, 148L, 85L,
162L, 125L, 120L, 160L, 134L, 82L, 96L, 77L, 118L), infant = c(26.7,
23.7, 17, 16.8, 13.5, 10.1, 12.9, 20.4, 17.8, 25.7, 11.7,
11.6, 16.2, 11.3, 44.8, 71.5, 9.6, 12.8, 17.5, 17.6, 86.3,
78.5, 125, NA, 28.1, 300, 58, 650, 51.7, 59.6, 170, 78, 62.8,
54.4, 48.8, 27.8, 79.1, 22.1, 26.2, 13.6, 32, 60.9, 46, 34.1,
65.1, 20.4, 15.1, 19.1, 26.2, 76.3, 40.4, 43.3, 259, 60.4,
137, 180, 114, 58.2, 63.7, 39.3, 138, 21.3, 58, 159.2, 149,
10.2, 38.6, 67.9, 21.7, 27, 153, 100, 400, 124.3, 200, 150,
100, 190, 160, 109.6, 84.2, 216, NA, 60.6, 55, NA, 102, 148.3,
120, 187, NA, 200, 124.3, 132.9, 170, 158, 45.1, 129.4, 162.5,
127, 160, 180, 80, 50, 104), region = structure(c(3L, 4L,
4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 1L, 4L,
4L, 4L, 2L, 1L, 2L, 3L, 3L, 3L, 1L, 1L, 3L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 4L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 4L,
3L, 2L, 1L, 2L, 4L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 3L,
3L, 1L, 1L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L,
1L, 1L, 1L, 1L, 1L, 2L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 3L, 1L,
3L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 1L), .Label = c("Africa",
"Americas", "Asia", "Europe"), class = "factor"), oil = structure(c(1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("no",
"yes"), class = "factor")), class = "data.frame", row.names = c(NA,
-105L))

Here are the results:
Correct: 26th and 28th. Wrong: 25th and 27th
Could anyone explain why it happened?

Thanks

and in fact the 24th observation in the dataset has a null value to supply to the model.
Why not filter out observations with NA's in your important variables as a first step in your workflow ?
or try impute missing values, ...

Hi
The dataset was uploaded by the lecturer so we cannot modify the dataset. I did na.omit at first to filter out the NA values in the dataset. However, I found that the length of the x and y variables will be different, which caused problems in building the SLR. That's why I keep the NA values in the dataset.

Hi @Harry2908,
It's complicated and is due to the presence of NAs. By using this command:

``````plot(cooks.distance(mortality.model))
``````

you have over-ridden the default plot method for Cooks Distance data. As a result the indexing of the original dataframe gets scrambled and then `identify()` gets it wrong.
If you stick with the default plot method which is employed when you just use:

``````plot(mortality.model, which = 4)
``````

the correct indexing is shown on the graph.
So, the take-home message is that you could get `identify()` to work by manipulating the input data frame and the output but is it necessary?
HTH

``````df.mortality_with_NA <- structure(list(
X = structure(c(
4L, 5L, 7L, 15L, 23L, 29L, 30L, 101L,
41L, 43L, 46L, 61L, 62L, 66L, 73L, 79L, 86L, 87L, 10L, 97L, 2L, 25L, 38L,
39L, 40L, 52L, 65L, 75L, 100L, 3L, 9L, 18L, 19L, 21L, 24L, 32L, 33L, 42L,
45L, 50L, 55L, 58L, 63L, 68L, 71L, 77L, 83L, 89L, 93L, 94L, 99L, 103L,
105L, 8L, 14L, 20L, 26L, 27L, 31L, 36L, 44L, 47L, 80L, 51L, 59L, 69L, 70L,
72L, 88L, 91L, 95L, 81L, 1L, 6L, 11L, 12L, 13L, 16L, 17L, 22L, 28L, 34L,
35L, 37L, 48L, 49L, 53L, 54L, 56L, 57L, 60L, 64L, 67L, 74L, 76L, 78L, 84L,
85L, 90L, 92L, 96L, 98L, 82L, 102L, 104L
), .Label = c(
"Afganistan",
"Algeria", "Argentina", "Australia", "Austria", "Bangladesh", "Belgium",
"Bolivia", "Brazil", "Britain", "Burma", "Burundi", "Cambodia", "Cameroon",
"Egypt", "El.Salvador", "Ethiopia", "Finland", "France", "Ghana",
"Greece", "Guatemala", "Guinea", "Haiti", "Honduras", "India",
"Indonesia", "Iran", "Iraq", "Ireland", "Israel", "Italy", "Ivory.Coast",
"Jamaica", "Japan", "Jordan", "Kenya", "Laos", "Lebanon", "Liberia",
"Libya", "Madagascar", "Malawi", "Malaysia", "Mali", "Mauritania",
"Mexico", "Moroco", "Nepal", "Netherlands", "New.Zealand", "Nicaragua",
"Niger", "Nigeria", "Norway", "Pakistan", "Panama", "Papua.New.Guinea",
"Paraguay", "Peru", "Philippines", "Portugal", "Rwanda", "Saudi.Arabia",
"Sierra.Leone", "Singapore", "Somalia", "South.Africa", "South.Korea",
"South.Vietnam", "Southern.Yemen", "Spain", "Sri.Lanka", "Sudan",
"Sweden", "Switzerland", "Syria", "Taiwan", "Tanzania", "Thailand",
"United.States", "Upper.Volta", "Uruguay", "Venezuela", "West.Germany",
"Yemen", "Yugoslavia", "Zaire", "Zambia"
), class = "factor"),
income = c(
3426L, 3350L, 3346L, 4751L, 5029L, 3312L, 3403L,
5040L, 2009L, 2298L, 3292L, 4103L, 3723L, 4102L, 956L, 1000L,
5596L, 2963L, 2503L, 5523L, 400L, 250L, 110L, 1280L, 560L,
3010L, 220L, 1530L, 1240L, 1191L, 425L, 590L, 426L, 725L,
406L, 1760L, 302L, 2526L, 727L, 631L, 295L, 684L, 507L, 754L,
335L, 1268L, 1256L, 261L, 732L, 434L, 799L, 406L, 310L, 200L,
100L, 281L, 210L, 319L, 217L, 284L, 387L, 334L, 344L, 197L,
279L, 477L, 347L, 230L, 334L, 210L, 435L, 130L, 75L, 100L,
73L, 68L, 123L, 122L, 70L, 81L, 79L, 79L, 100L, 93L, 169L,
71L, 120L, 130L, 50L, 174L, 90L, 70L, 102L, 61L, 148L, 85L,
162L, 125L, 120L, 160L, 134L, 82L, 96L, 77L, 118L
), infant = c(
26.7,
23.7, 17, 16.8, 13.5, 10.1, 12.9, 20.4, 17.8, 25.7, 11.7,
11.6, 16.2, 11.3, 44.8, 71.5, 9.6, 12.8, 17.5, 17.6, 86.3,
78.5, 125, NA, 28.1, 300, 58, 650, 51.7, 59.6, 170, 78, 62.8,
54.4, 48.8, 27.8, 79.1, 22.1, 26.2, 13.6, 32, 60.9, 46, 34.1,
65.1, 20.4, 15.1, 19.1, 26.2, 76.3, 40.4, 43.3, 259, 60.4,
137, 180, 114, 58.2, 63.7, 39.3, 138, 21.3, 58, 159.2, 149,
10.2, 38.6, 67.9, 21.7, 27, 153, 100, 400, 124.3, 200, 150,
100, 190, 160, 109.6, 84.2, 216, NA, 60.6, 55, NA, 102, 148.3,
120, 187, NA, 200, 124.3, 132.9, 170, 158, 45.1, 129.4, 162.5,
127, 160, 180, 80, 50, 104
), region = structure(c(
3L, 4L,
4L, 2L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 4L, 4L, 1L, 4L,
4L, 4L, 2L, 1L, 2L, 3L, 3L, 3L, 1L, 1L, 3L, 2L, 2L, 2L, 2L,
2L, 2L, 2L, 4L, 2L, 3L, 2L, 3L, 3L, 2L, 2L, 2L, 2L, 3L, 4L,
3L, 2L, 1L, 2L, 4L, 1L, 2L, 1L, 1L, 1L, 2L, 1L, 2L, 1L, 3L,
3L, 1L, 1L, 3L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 1L, 3L,
1L, 1L, 1L, 1L, 1L, 2L, 3L, 1L, 3L, 1L, 1L, 1L, 1L, 3L, 1L,
3L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 1L
), .Label = c(
"Africa",
"Americas", "Asia", "Europe"
), class = "factor"), oil = structure(c(
1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L
), .Label = c(
"no",
"yes"
), class = "factor")
), class = "data.frame", row.names = c(
NA,
-105L
))

length(df.mortality_with_NA\$income)
length(df.mortality_with_NA\$infant)

df.mortality <- na.omit(df.mortality_with_NA)

length(df.mortality\$income)
length(df.mortality\$infant)``````

Hi @DavoWW

I think I got the answer. it is because identify function will skip NA values.

Thanks

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.