finding variable types

how do I find the variable type of variables in a dataset.

Call str() on the dataset.

So that didn't work. I mean the variable type as in ordinal, non-ordinal, discrete, continuous.

Use str() to determine if you have numbers or factors. But R doesn't know if a number is an infinite decimal, which is what I think you mean by continuous, because it stores every number with a finite number of significant digits.

okay so, I am working with the titanic library in the titanic_train dataset and I am asked to designate the variable types of the variables in the data table and I am unable to decide for each of them, as there is no clear distinction.

I think str() should be fine to identify numeric and character variables. Perhaps some variables should be converted to factors.

str(your_dataset) will report to the console what the variables are but the str() function has a NULL return value.

If you want a vector which contains the class of each variable, you can use,

vapply(your_dataset, class, character (1))

Alternately, you could use,

sapply(your_dataset, class)

But, vapply() should almost always be considered preferable when the output of each function call shares a fixed structure.

no no. You see I know the values that str() provides, but the I need their classification as in ordinals, non-ordinals, discrete, continuous. So is there a function in R that can help me with that.

I don't think so. I think it is up to you to look at the definitions of the variables and decide what kind of data they represent. For example, survived needs to be converted to a factor, which is not ordinal. Age and fare need to be converted to numeric. R does not know these things.

There is no function which can do that because it's not a solvable problem.

Take even just the difference between discrete and continuous variables...

Maybe your data has heights, which are measured in inches. Is this variable discrete or continuous? Well, it should be continuous since, as a physical measurement height couple theoretically be measured to, within the bounds of physics, an infinite precision. But... If the data was only measured to the nearest inch, and is stored in the data.frame as an integer value, the data is discrete (there's no fractional inches).

So, to answer this kind of question you need to have an understanding of what the data is measuring (as well as what you intend to do with it).

About example which I usually present to my Intro Stats students is this...

What type of variable is grade in school?

Well... It could be numeric. For instance, you might want to perform a regression on vocabulary size as a function of number of completed years of formal education.

Or... It could be categorical. For instance, if you wanted to analyze student views on a proposed school policy by freshmen, sophomores, juniors, and seniors.

The point is, classifying variables in this way is not always so clear cut, and it's this ambiguity which makes it impossible for software to answer accurately.

If you're hell-bent on it, you could always establish some assumptions and write your own function to do it.

For example, if a variable consists of only integer valued elements in a regular sequence, e.g.

all(sort(df$x) == seq_along(df$x)) == TRUE

You might be comfortable classifying that as an ordinal variable.

Thank you, that was sooo helpful. Your students are really lucky to have you!!

This topic was automatically closed 21 days after the last reply. New replies are no longer allowed.

If you have a query related to it or one of the replies, start a new topic and refer back with a link.