Our assignment this week was to find a bug in a function with deliberate bugs.
The function was
tukey_multiple <- function(x) {
outliers <- array(TRUE,dim=dim(x))
for (j in 1:ncol(x))
{
outliers[,j] <- outliers[,j] && tukey.outlier(x[,j])
}
outlier.vec <- vector(length=nrow(x))
for (i in 1:nrow(x)) { outlier.vec[i] <- all(outliers[i,]) } return(outlier.vec) }
Running it as such reports an error in
Error: unexpected symbol in:
"for (i in 1:nrow(x))
{ outlier.vec[i] <- all(outliers[i,]) } return"
I restructured the last few lines so they more closely matched the top
tukey_multiple <- function(x) {
outliers <- array(TRUE,dim=dim(x))
for (j in 1:ncol(x))
{
outliers[,j] <- outliers[,j] && tukey.outlier(x[,j])
}
outlier.vec <- vector(length=nrow(x))
for (i in 1:nrow(x))
{
outlier.vec[i] <- all(outliers[i,])
}
return(outlier.vec)
}
The function then was properly created. However, when applying it to a matrix
tukey_multiple(matrix( rnorm(5*5,mean=0,sd=1), 5, 5))
Error: could not find function "tukey.outlier"
tukey.outlier isn't actually a function. I'm not sure if this was deliberate, or if it's wrapped up in a package I don't have. Google didn't reveal anything useful for a search on "tukey.outlier R". So I proceeded to use some of R's debugging.
> debug(tukey_multiple)
> tukey_multiple(matrix( rnorm(5*5,mean=0,sd=1), 5, 5))
debugging in: tukey_multiple(matrix(rnorm(5 * 5, mean = 0, sd = 1), 5, 5))
debug at #1: {
outliers <- array(TRUE, dim = dim(x))
for (j in 1:ncol(x)) {
outliers[, j] <- outliers[, j] && tukey.outlier(x[, j])
}
outlier.vec <- vector(length = nrow(x))
for (i in 1:nrow(x)) {
outlier.vec[i] <- all(outliers[i, ])
}
return(outlier.vec)
}
Browse[2]>
Not sure why mine says Browse[2] and the lecture does Browse[1], and I'm not totally certain how to proceed from here. However, if this missing function was found it seems the function should work.
Wednesday, March 22, 2017
Friday, March 10, 2017
Visualization and graphics
This is a topic on which I have struggled in the past. I can do the basics, but I know there is a lot more to it than I have yet to unlock. It seems many in biology who favor R like it for the statistical capabilities, but not so much the graphical aspect. I hope to not fall into that rut.
For these examples I used the "biomass" dataset on https://vincentarelbundock.github.io/Rdatasets/datasets.html
I kept each different plotting device as a function of "Tas" and "year" so I can really explore the base functions of each. Github code found here https://github.com/jcrichardson617/R_class/blob/master/module9
Using R's built in plot() function, the graph looks fairly simple.
Using the lattice package:
And ggplot2:
Having each be very basic like this provided little difficulties, and there is of course room for improvement on each. As the instructor of the course is big into visualization, and talks about ggplot a lot, I am inclined to lean that way. Even from these basic plots though, it seems like ggplot used more code than lattice and plot, so it probably has a steep learning curve.
For these examples I used the "biomass" dataset on https://vincentarelbundock.github.io/Rdatasets/datasets.html
I kept each different plotting device as a function of "Tas" and "year" so I can really explore the base functions of each. Github code found here https://github.com/jcrichardson617/R_class/blob/master/module9
Using R's built in plot() function, the graph looks fairly simple.
Using the lattice package:
Having each be very basic like this provided little difficulties, and there is of course room for improvement on each. As the instructor of the course is big into visualization, and talks about ggplot a lot, I am inclined to lean that way. Even from these basic plots though, it seems like ggplot used more code than lattice and plot, so it probably has a steep learning curve.
Monday, February 27, 2017
Inputs and outputs
Inputting data might be one of the most important things to do in R. Without it, no analyses or visualizations can be performed. It seems like a lot of tutorials online on basic introduction skip this step in favor of randomly generated data.
Here is my full code for this week's assignment, my thoughts will follow it.
install.packages("plyr")
library("plyr", lib.loc="C:/Program Files/R/R-3.2.3/library")
x <- read.table(file.choose(),header=TRUE,sep=",") #AWESOME!
x #make sure it works
y <- ddply(x, "Sex", transform, Grade.Average = mean(Grade))
y #check it
write.table(y,"Sorted_Average") #where does it put the file? Documents
write.table(y,"Sorted_Average", sep=",")
newx <- subset(x,grepl("[iI]",x$Name))
write.table(newx,"DataSubset",sep=",")
I first installed the package plyr, for which we'll be using some functions.
The read.table and file.choose combination was pretty awesome. I usually set a directory (via the button in Rstudio and then do read.csv or read.table. This code eliminates a step, which is pretty cool.
I found the ddply function interesting. I liked sorting the data by a certain factor, I'm not sure how I feel about the average for that factor getting added to every line, that felt a little bit "busy". I was confused at first how it computed an average for one value, then I realized they were all the same per male and female.
I wasn't sure at first it my write.table worked. I didn't get an error, but nothing happened. Then I noticed a file with the name Sorted_Average in the files tab, and had to track down where it went by searching the name in my computer's directory. Maybe the "change directory" still has some usage for this purpose.
Here is my full code for this week's assignment, my thoughts will follow it.
install.packages("plyr")
library("plyr", lib.loc="C:/Program Files/R/R-3.2.3/library")
x <- read.table(file.choose(),header=TRUE,sep=",") #AWESOME!
x #make sure it works
y <- ddply(x, "Sex", transform, Grade.Average = mean(Grade))
y #check it
write.table(y,"Sorted_Average") #where does it put the file? Documents
write.table(y,"Sorted_Average", sep=",")
newx <- subset(x,grepl("[iI]",x$Name))
write.table(newx,"DataSubset",sep=",")
I first installed the package plyr, for which we'll be using some functions.
The read.table and file.choose combination was pretty awesome. I usually set a directory (via the button in Rstudio and then do read.csv or read.table. This code eliminates a step, which is pretty cool.
I found the ddply function interesting. I liked sorting the data by a certain factor, I'm not sure how I feel about the average for that factor getting added to every line, that felt a little bit "busy". I was confused at first how it computed an average for one value, then I realized they were all the same per male and female.
I wasn't sure at first it my write.table worked. I didn't get an error, but nothing happened. Then I noticed a file with the name Sorted_Average in the files tab, and had to track down where it went by searching the name in my computer's directory. Maybe the "change directory" still has some usage for this purpose.
Friday, February 24, 2017
S3 and S4 classes
Greetings world (or just Dr. Friedman),
In R class this week we learned some more about R objects and programming with objects. We took a look at S3 and S4 classes, new terms to me. S3 is informal and has been in R since the beginning of the R language. S3 is simpler, not as rigorous, but is "not as safe" as S4. Class reassignment is more difficult in S4 as well. Most of the material this week was new to me. Although the concepts are not trivial, I think I have a hold on it (mostly!). I have enjoyed learning about classes and generic functions because I feel that I better understand how R works behind the scenes. I learned from the lecture and the text book how to use methods() and make my own generic function. However, I think I need a lot of practice on the latter. Hadley Wickham has an informative post in Advanced R, “OO R”.
I'll use a dataset on invasive iguanas to answer the question, which includes an arbitrarily assigned ID, collection date (M/D/Y format), sex, length, mass, and various other morphological features.
iguana <- read.csv(“iguana.csv”)
Using some functions, we can classify the data:
> isS4(iguana)
[1] FALSE
[1] FALSE
> mode(iguana)
[1] "list"
[1] "list"
> typeof(iguana)
[1] "list"
[1] "list"
> class(iguana)
[1] "data.frame"
> is.data.frame(iguana)[1] "data.frame"
[1] TRUE
> ismatrix(iguana)
[1] FALSE
[1] FALSE
"iguana" is not S4 class, mode() and typeof() return list whereas class() returns data.frame. As it is a data frame, generic functions for this class will work.
I do not think I have seen an S4 object, as every dataset I have used is subset with "$" rather than "@". I will explore further.
Monday, February 13, 2017
Matrices
This week went very smoothly, working with matrices and doing some simple algebra with them. These are all built in function in R, most of which I had seen. I made up my own matrix values, as those suggested to use on Canvas do not work (since 6 does not divide evenly into 100 or 1000).
> A <- matrix(runif(100,1, 400), nrow = 10, ncol = 10)
> B <- matrix(runif(2500, 1, 500), nrow = 50, ncol = 50)
Taking the inverse requires only 1 letter...
> t(A)
The output will not format well in here, so I will skip pasting that in, calling t(B) returns the inverse for matrix B.
Determinants are easy as well, the same idea as a transverse;
> det(A)
[1] -8.837604e+23
> det(B)
[1] 8.667763e+139
The inverse calls a rather odd function called "solve". It seems to do a lot. I will have to explore it more to get a better feel for it.
> solve(A)
Again, this is a large output which will not appear well here.
Multiplying a matrix by a vector is also as simple as any other multiplication. First we need a vector with the same number of columns:
> a <- c(seq(11,20,1)
> a*A
> a*B
still works, even though "a" has 10 values and B has 50 columns. R will automatically loop the "a" vector to match the size of the "B" matrix.
That looks like all the calculation needed of us this week. Off to learn more about "solve"!
> A <- matrix(runif(100,1, 400), nrow = 10, ncol = 10)
> B <- matrix(runif(2500, 1, 500), nrow = 50, ncol = 50)
Taking the inverse requires only 1 letter...
> t(A)
The output will not format well in here, so I will skip pasting that in, calling t(B) returns the inverse for matrix B.
Determinants are easy as well, the same idea as a transverse;
> det(A)
[1] -8.837604e+23
> det(B)
[1] 8.667763e+139
The inverse calls a rather odd function called "solve". It seems to do a lot. I will have to explore it more to get a better feel for it.
> solve(A)
Again, this is a large output which will not appear well here.
Multiplying a matrix by a vector is also as simple as any other multiplication. First we need a vector with the same number of columns:
> a <- c(seq(11,20,1)
> a*A
> a*B
still works, even though "a" has 10 values and B has 50 columns. R will automatically loop the "a" vector to match the size of the "B" matrix.
That looks like all the calculation needed of us this week. Off to learn more about "solve"!
Monday, February 6, 2017
Boxplots and Histograms
This was a lucky week for me in R class. I have been doing some analyses on my own work lately, and as such have been looking at a lot of histograms and boxplots. The assignment was to create some of these exact things on blood pressures and doctors ratings of those.
Blood pressures were given in numbers, but ratings were presented as good/bad or low/high. I don't like working with words in R, it's terrifying, so I changed these to numbers instead. Highs and bads became 1's, while lows and goods became 1's.
Boxplots and histograms provide a simple but powerful way to explore data without delving into statistics too much. It is important to remember to perform these basic procedures even as our datasets become more advanced.
Blood pressures were given in numbers, but ratings were presented as good/bad or low/high. I don't like working with words in R, it's terrifying, so I changed these to numbers instead. Highs and bads became 1's, while lows and goods became 1's.
Boxplots and histograms provide a simple but powerful way to explore data without delving into statistics too much. It is important to remember to perform these basic procedures even as our datasets become more advanced.
![]() |
Figure 1. Frequency of hospital visitations for 10 patients. |
![]() |
Figure 2. Histogram of blood pressure for 10 patients. |
![]() |
Figure 3. Boxplot of blood pressures assessed by one doctor. 0 is "bad" and 1 is "good" |
![]() |
Figure 4. Boxplot of blood pressures assessed by a second doctor. 0 is "low" and 1 is "high" |
![]() |
Figure 5. Boxplot of blood pressures assessed by a third doctor. 0 is "low" and 1 is "high" |
Friday, February 3, 2017
Functions
Functions are something I love to hate. They can be extremely frustrating, but it's a good feeling when they work well. Our assignment this week was to write a simple function. I wrote one to assess the pass/fail rate on a test, and return the percentage of passing students. Below, I present my workthrough to the final product, which can be found on github HERE
First we need some data.
x <- rnorm(250,75,22)
250 students took a test, the average was a 75, and the standard deviation was 22.
I then wrote a simple function to assess show which students passed and which failed.
passfail <- function(x) {
ifelse(x>=60,'pass','fail')
}
then I tested it out
passfail(x)
[1] "pass" "pass" "pass" "pass" "fail" "fail" "fail" "pass" "fail" "pass" "fail" "pass" "fail" "pass"
[15] "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "fail"
[29] "fail" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "fail"
[43] "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "fail"
[57] "fail" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass"
[71] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "fail" "pass" "pass"
[85] "fail" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "pass"
[99] "fail" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass"
[113] "pass" "fail" "pass" "fail" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass"
[127] "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "fail" "pass" "fail" "fail" "fail" "pass"
[141] "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "fail" "pass"
[155] "pass" "pass" "pass" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "fail" "pass"
[169] "pass" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass"
[183] "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "fail" "pass"
[197] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass"
[211] "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[225] "pass" "pass" "pass" "fail" "pass" "pass" "fail" "pass" "pass" "fail" "pass" "fail" "pass" "pass"
[239] "fail" "fail" "fail" "fail" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
Cool! It worked. However when looking at examples I noticed a low of functions had this "return" function written within. Let's just do what everyone else does and incorporate it.
passfail <- function(x) {
score <- ifelse(x>=60,'pass','fail')
print("score")
}
passfail(x)
[1] "score"
Oops! That's not what we want. Turns out "score" should not be parenthesized.
passfail <- function(x) {
score <- ifelse(x>=60,'pass','fail')
print(score)
}
passfail(x)
[1] "pass" "pass" "pass" "pass" "fail" "fail" "fail" "pass" "fail" "pass" "fail" "pass" "fail" "pass"
[15] "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "fail"
[29] "fail" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "fail"
[43] "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "fail"
[57] "fail" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass"
[71] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "fail" "pass" "pass"
[85] "fail" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "pass"
[99] "fail" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass"
[113] "pass" "fail" "pass" "fail" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass"
[127] "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "fail" "pass" "fail" "fail" "fail" "pass"
[141] "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "fail" "pass"
[155] "pass" "pass" "pass" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "fail" "pass"
[169] "pass" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass"
[183] "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "fail" "pass"
[197] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass"
[211] "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[225] "pass" "pass" "pass" "fail" "pass" "pass" "fail" "pass" "pass" "fail" "pass" "fail" "pass" "pass"
[239] "fail" "fail" "fail" "fail" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
Working again, good. It's printing the argument of 'score' rather than the word score, which is what parenthesis made it do.
This by itself is good, but why not see what percentage actually passes. I worked "percentage" in as a returned argument as well;
passfail <- function(x) {
score <- ifelse(x>=60,'pass','fail')
percentages <- ((length(which(score=='pass')))/(length(score)))*100
print(score)
print(percentages)
}
passfail(x)
[1] "pass" "pass" "pass" "pass" "fail" "fail" "fail" "pass" "fail" "pass" "fail" "pass" "fail" "pass"
[15] "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "fail"
[29] "fail" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "fail"
[43] "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "fail"
[57] "fail" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass"
[71] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "fail" "pass" "pass"
[85] "fail" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "pass"
[99] "fail" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass"
[113] "pass" "fail" "pass" "fail" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass"
[127] "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "fail" "pass" "fail" "fail" "fail" "pass"
[141] "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "fail" "pass"
[155] "pass" "pass" "pass" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "fail" "pass"
[169] "pass" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass"
[183] "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "fail" "pass"
[197] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass"
[211] "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[225] "pass" "pass" "pass" "fail" "pass" "pass" "fail" "pass" "pass" "fail" "pass" "fail" "pass" "pass"
[239] "fail" "fail" "fail" "fail" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[1] 72.8
We can now see the full list of individual passes and fails, and the percentage of student's who passed.
The way it is now is fine perhaps, but it's a bit general. 60 points might be a good cutoff for most tests out of 100 points, but maybe a professor gives a very difficult test in which only 40 points are needed to pass. This next step marks the final product, and has an additional element set by the user, "y", which tells the function how many points equals a "pass". Let's try it out.
passfail <- function(x, y) {
score <- ifelse(x>=y,'pass','fail')
percentages <- ((length(which(score=='pass')))/(length(score)))*100
print(score)
print(percentages)
}
passfail(x, 40)
[1] "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "fail" "pass" "pass" "pass"
[15] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[29] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[43] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[57] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[71] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[85] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[99] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass"
[113] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[127] "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[141] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[155] "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[169] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[183] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[197] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "pass"
[211] "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass"
[225] "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[239] "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[1] 95.6
First we need some data.
x <- rnorm(250,75,22)
250 students took a test, the average was a 75, and the standard deviation was 22.
I then wrote a simple function to assess show which students passed and which failed.
passfail <- function(x) {
ifelse(x>=60,'pass','fail')
}
then I tested it out
passfail(x)
[1] "pass" "pass" "pass" "pass" "fail" "fail" "fail" "pass" "fail" "pass" "fail" "pass" "fail" "pass"
[15] "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "fail"
[29] "fail" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "fail"
[43] "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "fail"
[57] "fail" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass"
[71] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "fail" "pass" "pass"
[85] "fail" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "pass"
[99] "fail" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass"
[113] "pass" "fail" "pass" "fail" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass"
[127] "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "fail" "pass" "fail" "fail" "fail" "pass"
[141] "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "fail" "pass"
[155] "pass" "pass" "pass" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "fail" "pass"
[169] "pass" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass"
[183] "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "fail" "pass"
[197] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass"
[211] "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[225] "pass" "pass" "pass" "fail" "pass" "pass" "fail" "pass" "pass" "fail" "pass" "fail" "pass" "pass"
[239] "fail" "fail" "fail" "fail" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
Cool! It worked. However when looking at examples I noticed a low of functions had this "return" function written within. Let's just do what everyone else does and incorporate it.
passfail <- function(x) {
score <- ifelse(x>=60,'pass','fail')
print("score")
}
passfail(x)
[1] "score"
Oops! That's not what we want. Turns out "score" should not be parenthesized.
passfail <- function(x) {
score <- ifelse(x>=60,'pass','fail')
print(score)
}
passfail(x)
[1] "pass" "pass" "pass" "pass" "fail" "fail" "fail" "pass" "fail" "pass" "fail" "pass" "fail" "pass"
[15] "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "fail"
[29] "fail" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "fail"
[43] "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "fail"
[57] "fail" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass"
[71] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "fail" "pass" "pass"
[85] "fail" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "pass"
[99] "fail" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass"
[113] "pass" "fail" "pass" "fail" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass"
[127] "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "fail" "pass" "fail" "fail" "fail" "pass"
[141] "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "fail" "pass"
[155] "pass" "pass" "pass" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "fail" "pass"
[169] "pass" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass"
[183] "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "fail" "pass"
[197] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass"
[211] "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[225] "pass" "pass" "pass" "fail" "pass" "pass" "fail" "pass" "pass" "fail" "pass" "fail" "pass" "pass"
[239] "fail" "fail" "fail" "fail" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
Working again, good. It's printing the argument of 'score' rather than the word score, which is what parenthesis made it do.
This by itself is good, but why not see what percentage actually passes. I worked "percentage" in as a returned argument as well;
passfail <- function(x) {
score <- ifelse(x>=60,'pass','fail')
percentages <- ((length(which(score=='pass')))/(length(score)))*100
print(score)
print(percentages)
}
passfail(x)
[1] "pass" "pass" "pass" "pass" "fail" "fail" "fail" "pass" "fail" "pass" "fail" "pass" "fail" "pass"
[15] "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "fail"
[29] "fail" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "fail"
[43] "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "fail"
[57] "fail" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass"
[71] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "fail" "pass" "pass"
[85] "fail" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "fail" "pass" "pass"
[99] "fail" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass"
[113] "pass" "fail" "pass" "fail" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass"
[127] "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "fail" "pass" "fail" "fail" "fail" "pass"
[141] "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "fail" "pass"
[155] "pass" "pass" "pass" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "fail" "fail" "pass"
[169] "pass" "fail" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass"
[183] "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "fail" "pass"
[197] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass"
[211] "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[225] "pass" "pass" "pass" "fail" "pass" "pass" "fail" "pass" "pass" "fail" "pass" "fail" "pass" "pass"
[239] "fail" "fail" "fail" "fail" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[1] 72.8
We can now see the full list of individual passes and fails, and the percentage of student's who passed.
The way it is now is fine perhaps, but it's a bit general. 60 points might be a good cutoff for most tests out of 100 points, but maybe a professor gives a very difficult test in which only 40 points are needed to pass. This next step marks the final product, and has an additional element set by the user, "y", which tells the function how many points equals a "pass". Let's try it out.
passfail <- function(x, y) {
score <- ifelse(x>=y,'pass','fail')
percentages <- ((length(which(score=='pass')))/(length(score)))*100
print(score)
print(percentages)
}
passfail(x, 40)
[1] "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "fail" "pass" "pass" "pass"
[15] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[29] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[43] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[57] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[71] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[85] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[99] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass"
[113] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[127] "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[141] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[155] "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[169] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[183] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[197] "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "fail" "pass" "pass" "pass" "pass"
[211] "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass"
[225] "pass" "pass" "pass" "pass" "pass" "pass" "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[239] "fail" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass" "pass"
[1] 95.6
Sunday, January 29, 2017
Dataframes vs matrices, and ggplot
We learned about the differences for matrices and data frames in week 3. I was already familiar with data frames from my work, and did not realize the two were different. I am so used to just importing csv files made in Microsoft Excel that I never stopped to think about the inner workings on how R processes these things.
Our assignment this week was was to enter a small data set polling results from two news agencies for seven of the major US presidential candidates from last year and do some converting from matrix to data frame to generate some numbers and graphs.
name <- c("Jeb", "Donald", "Ted", "Marco", "Carly", "Hillary", "Bernie")
| |||
I used the cbind function to combine the individual name and result columns.
results <- cbind(name, abc, nbc)
I used the data.frame function to change it from a matrix to a data frame:
results <- as.data.frame(results)
We can then run summary statistics by calling the apply function as such
results <- apply(poll.df[,2:3], 2, mean)
abc nbc
24.14286 27.14286
apply(poll.matrix,2,sd)
abc nbc
23.26248 24.57932
I used the supplied ggplot function but had some errors. I played around with it on my own to generate this.
ggplot2 has a lot more functionality than the builtin plot/barplot, which I am used to. These two work well for functions in biology, since those graphs are often very barebones. I am excited to learn more about data visualization.
Wednesday, January 18, 2017
Objects, functions, and vectors
The second week of the R programming course introduced some of the basic workings of the R language, such as data types, and importing data into R. As it is a program designed for mathematical purposes, the manipulation of vectors and use of functions are vital.
I first imported the dataset for the example. I have used read.csv often for csv files on my computer, but have not done it from a url. It was easy enough (pretty much the same).
>acs <- read.csv(url("http://stat511. cwick.co.nz/homeworks/acs_or. csv"))
The dataset was saved to an OBJECT called "acs". An object can be anything defined by the user. It can be a large dataset such as this with multiple variables, a string of dates, a series of names, etc. Within acs are a series of variables, such as "age_husband", "age_wife", "income_husband", etc. These are all strings of values. In R, we call these VECTORS.
I always like to make sure the data imports correctly, for this I use the "head()" function. I like this because it shows the top few rows which is good for large datasets like this. Simply calling the object would use too many rows, and cut off the top.
>head(acs)
FUNCTIONS tell R how to process data. We can write a simple function which tells R to take the square of some values.
>squares <- function(x){
+ x * x
+}
We can then apply this to a single number
>squares(4)
[1] 16
or a VECTOR of numbers stored in the object "a"
> a <- c(1,3,6)
> squares(a)
[1] 1 9 36
This is a very basic function, they run the gamut from simple to complicated. Most simple ones of built into the R language already. This could have been accomplished by telling R to "^2" any vector/number.
Anyway, back to our example:
"View(acs)" was presented in the module, this was my first exposure to the function, and I think I like it a bit better than "head" since I can see everything.
We then learned about viewing individual columns
"acs$age_husband" Showed the column named "age_husband". It's good for scanning certain data for finding outliers (though best to plot histograms, see below), etc.
Using brackets allows the user to pull out a specific value from a matrix. The first value [r,] tell R which row and the second [,c] tell R which column. "acs[1,3]" pulls the first number in the third column.
It is also possible to query data which meet certain criteria using the subset function.
"a <- subset(acs , age_husband > age_wife)" creates a new dataset named "a" with only those rows in which the age of the husband is greater than that of the wife. This is good for certain scenarios where we know already one factor influences another.
Rudimentary statistical functions are built into R, similar to excel which most students are probably familiar with.
>mean(acs$age_husband)
>median(acs$age_husband)
>quantile(acs$age_wife)
>var(acs$age_wife)
>summary(acs)
And of course, no statistical software would be complete without plotting. Basic plotting in R is very simple, just call the "plot" function.
>plot(x = acs$age_husband , y = acs$age_wife, type = 'p')
The plot above specifies the x and y axes, and type p is for points. This plot is bare bones, we could add more parameters in to show color gradients in for other variables, etc. They can get quite busy if one chooses.
We can also make histograms, with this intuitively named function
>hist(acs$number_children)
Histograms are useful for finding underlying distributions for the selection of appropriate statistical test.
We can also draw summary statistics using "table"
>counts <- table(acs$bedrooms)
And finally, barplot generation has a slightly different code, just slap a "bar" in front of the aforementioned "plot"
>barplot(counts, main="Bedrooms Distribution", xlab="Number of Bedrooms")
We could even run as one line by putting count function in this line.
>barplot(table(acs$bedrooms), main="Bedrooms Distribution", xlab="Number of Bedrooms")
This walkthrough was luckily all review for me. However I was unaware of the formal names of things like objects and vectors. I am glad to be receiving some formal training in the program.
I first imported the dataset for the example. I have used read.csv often for csv files on my computer, but have not done it from a url. It was easy enough (pretty much the same).
>acs <- read.csv(url("http://stat511.
The dataset was saved to an OBJECT called "acs". An object can be anything defined by the user. It can be a large dataset such as this with multiple variables, a string of dates, a series of names, etc. Within acs are a series of variables, such as "age_husband", "age_wife", "income_husband", etc. These are all strings of values. In R, we call these VECTORS.
I always like to make sure the data imports correctly, for this I use the "head()" function. I like this because it shows the top few rows which is good for large datasets like this. Simply calling the object would use too many rows, and cut off the top.
>head(acs)
FUNCTIONS tell R how to process data. We can write a simple function which tells R to take the square of some values.
>squares <- function(x){
+ x * x
+}
We can then apply this to a single number
>squares(4)
[1] 16
or a VECTOR of numbers stored in the object "a"
> a <- c(1,3,6)
> squares(a)
[1] 1 9 36
This is a very basic function, they run the gamut from simple to complicated. Most simple ones of built into the R language already. This could have been accomplished by telling R to "^2" any vector/number.
Anyway, back to our example:
"View(acs)" was presented in the module, this was my first exposure to the function, and I think I like it a bit better than "head" since I can see everything.
We then learned about viewing individual columns
"acs$age_husband" Showed the column named "age_husband". It's good for scanning certain data for finding outliers (though best to plot histograms, see below), etc.
Using brackets allows the user to pull out a specific value from a matrix. The first value [r,] tell R which row and the second [,c] tell R which column. "acs[1,3]" pulls the first number in the third column.
It is also possible to query data which meet certain criteria using the subset function.
"a <- subset(acs , age_husband > age_wife)" creates a new dataset named "a" with only those rows in which the age of the husband is greater than that of the wife. This is good for certain scenarios where we know already one factor influences another.
Rudimentary statistical functions are built into R, similar to excel which most students are probably familiar with.
>mean(acs$age_husband)
>median(acs$age_husband)
>quantile(acs$age_wife)
>var(acs$age_wife)
>summary(acs)
And of course, no statistical software would be complete without plotting. Basic plotting in R is very simple, just call the "plot" function.
>plot(x = acs$age_husband , y = acs$age_wife, type = 'p')
The plot above specifies the x and y axes, and type p is for points. This plot is bare bones, we could add more parameters in to show color gradients in for other variables, etc. They can get quite busy if one chooses.
We can also make histograms, with this intuitively named function
>hist(acs$number_children)
Histograms are useful for finding underlying distributions for the selection of appropriate statistical test.
We can also draw summary statistics using "table"
>counts <- table(acs$bedrooms)
And finally, barplot generation has a slightly different code, just slap a "bar" in front of the aforementioned "plot"
>barplot(counts, main="Bedrooms Distribution", xlab="Number of Bedrooms")
We could even run as one line by putting count function in this line.
>barplot(table(acs$bedrooms), main="Bedrooms Distribution", xlab="Number of Bedrooms")
This walkthrough was luckily all review for me. However I was unaware of the formal names of things like objects and vectors. I am glad to be receiving some formal training in the program.
Monday, January 9, 2017
R course : Week 1
Blog is all set up. I am not new to R but certainly in need of a refresher and some formal training.
Subscribe to:
Posts (Atom)