It looks like all the visible variables in that screenshot are factors with the same number of levels as rows — how about the other 39 variables (
str() output could be useful here)? I can't imagine a factor with one level per row making much sense for any of these variables.
Usually the case where I see this happen is when someone imports data using
read.csv() (or another
read.table() variant) without changing the default
stringsAsFactors parameter from
FALSE. This causes most anything that isn't easily parseable as a number to be imported as text, and then immediately converted to a factor. But in those cases, you usually only get the same number of levels as rows if there were no repeated values, since the default factor levels will be the unique values of
x is the vector of values).
What's really weird here is that according to the screenshot,
Brand has identical values in the first three rows — but levels are supposed to be unique. You can force R to make duplicate levels (you will get a warning), but it shouldn't happen under normal circumstances. That makes me suspect that something beyond just
stringsAsFactors happened here, so that either there are duplicate levels (bad!) or there are a bunch of unused levels (also probably bad — or at least, unintended).
What do you get when you run the following?
## These should be the same! How different are they?
# All the levels
# Just the unique levels
# Just the levels that are in use
# Just the unique levels that are in use
## Preview the result of re-creating a factor out of Brand with default levels
All that said, it's possible that these weirdly formatted data have nothing to do with your slowdown (though they do seem problematic for analysis). Before you burn too much time in this direction, have you tried profiling your slow code? For some sensible and accessible advice on the subject, see: