Sunday, January 29, 2017

Dataframes vs matrices, and ggplot

We learned about the differences for matrices and data frames in week 3. I was already familiar with data frames from my work, and did not realize the two were different. I am so used to just importing csv files made in Microsoft Excel that I never stopped to think about the inner workings on how R processes these things. 
Our assignment this week was was to enter a small data set polling results from two news agencies for seven of the major US presidential candidates from last year and do some converting from matrix to data frame to generate some numbers and graphs. 
name <- c("Jeb", "Donald", "Ted", "Marco", "Carly", "Hillary", "Bernie")
abc <- c(4,62,51,21,2,14,15) nbc <- c(12,75,43,19,1,21,19)


I used the cbind function to combine the individual name and result columns. 
results <- cbind(name, abc, nbc)
I used the data.frame function to change it from a matrix to a data frame:
results <- as.data.frame(results)
We can then run summary statistics by calling the apply function as such 
results <- apply(poll.df[,2:3], 2, mean)
abc nbc 24.14286 27.14286
apply(poll.matrix,2,sd) abc nbc 23.26248 24.57932
I used the supplied ggplot function but had some errors. I played around with it on my own to generate this.

ggplot2 has a lot more functionality than the builtin plot/barplot, which I am used to. These two work well for functions in biology, since those graphs are often very barebones. I am excited to learn more about data visualization.

Wednesday, January 18, 2017

Objects, functions, and vectors

The second week of the R programming course introduced some of the basic workings of the R language, such as data types, and importing data into R. As it is a program designed for mathematical purposes, the manipulation of vectors and use of functions are vital. 

I first imported the dataset for the example. I have used read.csv often for csv files on my computer, but have not done it from a url. It was easy enough (pretty much the same).

>acs <- read.csv(url("http://stat511.cwick.co.nz/homeworks/acs_or.csv")) 

The dataset was saved to an OBJECT called "acs". An object can be anything defined by the user. It can be a large dataset such as this with multiple variables, a string of dates, a series of names, etc. Within acs are a series of variables, such as "age_husband", "age_wife", "income_husband", etc. These are all strings of values. In R, we call these VECTORS. 

I always like to make sure the data imports correctly, for this I use the "head()" function. I like this because it shows the top few rows which is good for large datasets like this. Simply calling the object would use too many rows, and cut off the top. 


>head(acs) 

FUNCTIONS tell R how to process data. We can write a simple function which tells R to take the square of some values. 

>squares <- function(x){
+ x * x
+}

We can then apply this to a single number

>squares(4)
[1] 16

or a VECTOR of numbers stored in the object "a"

> a <- c(1,3,6)

> squares(a)

[1]  1  9 36

This is a very basic function, they run the gamut from simple to complicated. Most simple ones of built into the R language already. This could have been accomplished by telling R to "^2" any vector/number.


Anyway, back to our example:

"View(acs)" was presented in the module, this was my first exposure to the function, and I think I like it a bit better than "head" since I can see everything. 

We then learned about viewing individual columns


"acs$age_husband" Showed the column named "age_husband". It's good for scanning certain data for finding outliers (though best to plot histograms, see below), etc.

Using brackets allows the user to pull out a specific value from a matrix. The first value [r,] tell R which row and the second [,c] tell R which column. "acs[1,3]" pulls the first number in the third column.

It is also possible to query data which meet certain criteria using the subset function. 
"a <- subset(acs , age_husband > age_wife)" creates a new dataset named "a" with only those rows in which the age of the husband is greater than that of the wife. This is good for certain scenarios where we know already one factor influences another.

Rudimentary statistical functions are built into R, similar to excel which most students are probably familiar with.

>mean(acs$age_husband)
>median(acs$age_husband)
>quantile(acs$age_wife)
>var(acs$age_wife)
>summary(acs)


And of course, no statistical software would be complete without plotting. Basic plotting in R is very simple, just call the "plot" function. 

>plot(x = acs$age_husband , y = acs$age_wife, type = 'p')

The plot above specifies the x and y axes, and type p is for points. This plot is bare bones, we could add more parameters in to show color gradients in for other variables, etc. They can get quite busy if one chooses.

We can also make histograms, with this intuitively named function


>hist(acs$number_children) 

Histograms are useful for finding underlying distributions for the selection of appropriate statistical test.

We can also draw summary statistics using "table"

>counts <- table(acs$bedrooms)


And finally, barplot generation has a slightly different code, just slap a "bar" in front of the aforementioned "plot"

>barplot(counts, main="Bedrooms Distribution", xlab="Number of Bedrooms") 


We could even run as one line by putting count function in this line.

>barplot(table(acs$bedrooms), main="Bedrooms Distribution", xlab="Number of Bedrooms") 

This walkthrough was luckily all review for me. However I was unaware of the formal names of things like objects and vectors. I am glad to be receiving some formal training in the program. 

Monday, January 9, 2017

R course : Week 1

Blog is all set up. I am not new to R but certainly in need of a refresher and some formal training.