for loop gives output one element greater than input data length

I have the following code which populates a vector by performing a lookup of values in one dataframe on values in another dataframe (like a v or xlookup in microsoft excel). This is however too large for a spreadsheet . The challenge is that the resulting vector has a length of 1 greater than the number of rows of the input data. i.e 2,000,000 (data) and 2,000,001 (vector).
raw_data contains data that I want to lookup and IntdUniGroup contains data that raw_data will be looke up against (I hope that makes sense)


intd_uni_group <- c() #creates an empty vector

for(i in 1:nrow(raw_data)){
  intd_uni_group <- c(intd_uni_group, 
                      if(is.na(raw_data$PHDuniv[i])){
                        NA
                      } else if(tolower(raw_data$PHDuniv[i]) %in% tolower(IntdUniGroup$PHDuniv)){
                        IntdUniGroup$Group[which(tolower(raw_data$PHDuniv[i]) == tolower(IntdUniGroup$PHDuniv))]
                      } else{
                        NA
                      }
                      )
}

I also tried this using the apply function below. While the resulting list has a length equal to that of the input dataframe, when I unlist it, I end up with the same scenario as above with the length being one element larger than the number of rows in the input dataframe.

intd_uni_group2 <- apply(raw_data[,"PHDuniv"],
                            1,
                            function(x) if(is.na(x)){
                                          NA
                                        } else if(tolower(x) %in% tolower(IntdunivGroup$PHDuniv)){ 
                                          IntdunivGroup$Group[which(tolower(x) == tolower(IntdUnivGroup$PHDuniv))]
                                        } else{
                                          NA
                                        }
                         )

I tried to invent data similar to what you are using and I ran it with your code. I get a vector whose length matches the number of row in raw_data. Can you post some data that illustrates your problem.

You can also accomplish the same thing with a left_join. It requires less code and I expect it will be faster.

raw_data <- data.frame(PHDuniv=c("A",NA,"C","E"))
IntdUniGroup <- data.frame(PHDuniv=c("D","C","B","A"),
                           Group=c("Y","Y","Z","Z"))

intd_uni_group <- c() #creates an empty vector

for(i in 1:nrow(raw_data)){
  intd_uni_group <- c(intd_uni_group, 
                      if(is.na(raw_data$PHDuniv[i])){
                        NA
                      } else if(tolower(raw_data$PHDuniv[i]) %in% tolower(IntdUniGroup$PHDuniv)){
                        IntdUniGroup$Group[which(tolower(raw_data$PHDuniv[i]) == tolower(IntdUniGroup$PHDuniv))]
                      } else{
                        NA
                      }
  )
}
intd_uni_group
#> [1] "Z" NA  "Y" NA

#Using a left_join
library(dplyr)

NewDF <- left_join(raw_data, IntdUniGroup, by = "PHDuniv")
NewDF$Group
#> [1] "Z" NA  "Y" NA

Created on 2022-09-05 with reprex v2.0.2

@FJCC , thanks for the response. I should also have mentioned that I'd previously implemented this for about 3 separate datasets on separate occasions and didn't encounter this issue.
I also tried using the left join and the resulting dataset gives the same issue.
Following that, checked which values in the raw data and the resulting dataset from the left join were unequal and it appears I've found the problem. (I have you to thank for that)
One of the values gets duplicated during the iteration process because the source lookup data has duplicate values. I did however check it for duplicates but they weren't flagged because of trailing spaces in the duplicated values. It's sorted now. Thanks again

This topic was automatically closed 7 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.