An acquittance of my is teaching an introductory R session in a couple of weeks, and to give them some ideas, I wrote out my standard patter as I read the data in with read.csv. In case it helps I'll post it here too:
Having set R to pay attention to the correct folder, I read in the data file from the folder using the function read.csv, open parentheses, and then the name of the file as text so it’s in quote marks, then close parentheses.
If you want a metaphor for what's going on, think of it as a kitchen sausage machine (or spice grinder) - The parentheses as the funnel with the ingredients going in, but with different ingredients you'll still get a sausage, but of a different flavour. The sausage machine, read.csv, grinds up the ingredients and out comes the finished sausage.
If I run the read.csv command as it is, it just spills all the console. And if you think about it, it is just pouring out of the side of the sausage machine and over the kitchen table. We want to put the output of the csv somewhere, so we build an arrow with a greater than sign and a hyphen and put a named box there to catch the output. The name could be anything, and when we want to refer to the data in the future we use that name.
Having read in the data it’s now appearing over in the environment tab. But if I want to check it out, one way of investigating it is having a look at some basic summaries of each column. To do this I can run the summary command, feeding in the named box where I've stored data. The box of data isn’t a piece of text, so I don't need to put it in double quotes. As I am not storing the answer, it just spills over the console, but that is fine with me in this case.
But summary is not actually the most useful command for finding out about your data, the most useful command is str for structure because it tells you about what kind of information is in each column. You can only answer questions if it is the right kind of data- for example you can only do maths on numbers. And because columns can only contain one kind of thing, if you have a mix of numbers and text in a column, it will be decalred a column of text.
In the example data, I'm seeing numbers and I'm seeing a thing called a factor. In R, there is two kinds of text- factors are organised text like formally declared group labels. This is easy to analyse and graph. The alternative is disorganised text, called characters, which is not as easy to graph. But characters, being disorganised, are much easier to change than very formal categories. So if your data needs fixing before analysing, it is much easier if it starts as characters. We could change the kind of column after the fact, but we can also go back to where we read in the data, and use a slightly different set of ingredients (you can see what the possible ingredients are by checking the help for a function). In this case, I am adding a setting stringsAsFactors = FALSE, to stop it turning characters into factors.
When I re-run the line reading in the data, not much obviously changes, but when I repeat the str step the entries that were factors are now characters.
For those used to working in Excel, this highlights a couple of things:
One is, that by keeping my instructions in a script I can go back to earlier and repeat things. So it doesn't matter if you've mucked things up early on, because you can fix it, and then just repeat all the work in a matter of moments.
You can also imagine that if I was getting the same reports on a regular basis, with updated data in the reports organised in the same way each time, if I replaced my original data file with the latest one, I could repeat all my commands without doing any more work.