Jason Bryer

Visualizing Likert Items [Updated]

This post has been updated to reflect several changes to how these functions work. Specifically, an intermediate function likert is used to perform the item summary. The plot functions have been renamed to utilize S3 class naming convention. Lastly, the pisa dataset that is included with irutils has been renamed to pisana to reflect the fact that it includes only Canada, Mexico, and the United States. If you are interested in the complete PISA dataset in R, see the pisa R package hosted on github at https://github.com/jbryer/pisa.

I have become quite a big fan of graphics that combine the features of traditional figures (e.g. bar charts, histograms, etc.) with tables. That is, the combination of numerical results with a visual representation has been quite useful for exploring descriptive statistics. I have wrapped two of my favorites (build around ggplot2) and included them as part of my irutils R package (currently under development). Here is the code and results utilizing two item from the 2009 Programme of International Student Assessment (PISA).

NOTE: This R script can be downloaded here: https://github.com/jbryer/irutils/blob/master/irutils.test.R

library(irutils)
library(ggplot2)

theme_update(panel.background=theme_blank(), panel.grid.major=theme_blank(), panel.border=theme_blank())

#How much do you agree or disagree with these statements about reading?
data(pisana)

items28 = pisana[,substr(names(pisana), 1,5) == 'ST24Q']
head(items28); ncol(items28)
names(items28) = c("I read only if I have to.",
				   "Reading is one of my favorite hobbies.",
				   "I like talking about books with other people.",
				   "I find it hard to finish books.",
				   "I feel happy if I receive a book as a present.",
				   "For me, reading is a waste of time.",
				   "I enjoy going to a bookstore or a library.",
				   "I read only to get information that I need.",
				   "I cannot sit still and read for more than a few minutes.",
				   "I like to express my opinions about books I have read.",
				   "I like to exchange books with my friends")
for(i in 1:ncol(items28)) {
	items28[,i] = factor(items28[,i], levels=1:4,
		labels=c('Strongly disagree', 'Disagree', 'Agree', 'Strongly Agree'), ordered=TRUE)
}

l28 = likert(items28)
summary(l28)
                                                       Item      low     high     mean        sd
1                                 I read only if I have to. 58.72868 41.27132 2.291811 0.9369023
2                    Reading is one of my favorite hobbies. 56.64470 43.35530 2.344530 0.9277495
3             I like talking about books with other people. 54.99129 45.00871 2.328049 0.9090326
4                           I find it hard to finish books. 65.35125 34.64875 2.178299 0.8991628
5            I feel happy if I receive a book as a present. 46.93475 53.06525 2.466751 0.9446590
6                       For me, reading is a waste of time. 82.88729 17.11271 1.810093 0.8611554
7                I enjoy going to a bookstore or a library. 51.21231 48.78769 2.428508 0.9164136
8               I read only to get information that I need. 50.39874 49.60126 2.484616 0.9089688
9  I cannot sit still and read for more than a few minutes. 76.24524 23.75476 1.974736 0.8793028
10   I like to express my opinions about books I have read. 41.07516 58.92484 2.604913 0.9009968
11                 I like to exchange books with my friends 55.54115 44.45885 2.343193 0.9609234

plot(l28)
plot(l28, type='heat')
#Group by country
l28g = likert(items28, grouping = pisana$CNT)
plot(l28g, low.color='maroon', high.color='burlywood4')
#How often do you read these materials because you want to?
items29 = pisana[,substr(names(pisana), 1,5) == 'ST25Q']
names(items29) = c("Magazines", "Comic books", "Fiction", "Non-fiction books", "Newspapers")
for(i in 1:ncol(items29)) {
	items29[,i] = factor(items29[,i], levels=1:5,
		labels=c('Never or almost never', 'A few times a year', 'About once a month',
		'Several times a month', 'Several times a week'), ordered=TRUE)
}

l29 = likert(items29)
summary(l29)
               Item      low     high     mean       sd
1         Magazines 30.21689 48.45219 3.254813 1.245086
2       Comic books 62.43096 21.78536 2.298768 1.292631
3           Fiction 41.77380 38.60882 2.961111 1.342667
4 Non-fiction books 61.42466 19.02042 2.322898 1.199176
5        Newspapers 37.29377 46.97935 3.140282 1.442299

plot(l29, low.color='maroon', high.color='burlywood4') +
	opts(title="How often do you read these materials because you want to?")
plot(l29, type='heat') + opts(title="How often do you read these materials because you want to?")
l29g = likert(items29, grouping=pisana$CNT)
summary(l29g)
   Group              Item      low     high     mean       sd
1    CAN         Magazines 27.20968 47.90406 3.264809 1.232627
2    CAN       Comic books 73.01833 13.77646 1.977548 1.193580
3    CAN           Fiction 38.41963 41.32524 3.081012 1.359320
4    CAN Non-fiction books 59.74854 19.47494 2.379494 1.195189
5    CAN        Newspapers 36.00548 45.35285 3.112619 1.404332
6    MEX         Magazines 32.27957 48.99341 3.248538 1.255127
7    MEX       Comic books 53.45235 28.17407 2.573364 1.300123
8    MEX           Fiction 43.60494 37.30958 2.897222 1.329113
9    MEX Non-fiction books 62.91327 18.53804 2.274610 1.201369
10   MEX        Newspapers 37.00852 49.24856 3.198391 1.460371
11   USA         Magazines 28.30335 46.89495 3.256916 1.225441
12   USA       Comic books 81.66083 10.19059 1.699339 1.116374
13   USA           Fiction 43.10412 36.18881 2.902098 1.332053
14   USA Non-fiction books 57.91010 20.54874 2.427126 1.183237
15   USA        Newspapers 45.01160 37.50967 2.836620 1.431809

plot(l29g, low.color='maroon', high.color='burlywood4') +
	opts(title="How often do you read these materials because you want to?")

Graphic Parameters (symbols, line types, and colors) for ggplot2

Following up on John Mount’s post on remembering symbol parameters in ggplot2, I decided to give it a try and included symbols, line types, and colors (based upon Earl Glynn’s wonderful color chart).  Code follows below.

 

This last one is only the first 100 elements in colors(). Use the script file to generate the remaining plots if you like.

require(ggplot2)
require(grid)

theme_update(panel.background=theme_blank(),
panel.grid.major=theme_blank(),
panel.border=theme_blank())

#Borrowed (i.e. stollen) from http://research.stowers-institute.org/efg/R/Color/Chart/ColorChart.R
getColorHexAndDecimal <- function(color) {
if(is.na(color)) {
return(NA)
} else {
c <- col2rgb(color)
return(sprintf("#%02X%02X%02X %3d %3d %3d", c[1],c[2],c[3], c[1], c[2], c[3]))
}
}

#Symbols
ggplot(data=data.frame(x=c(0:25))) + geom_point(size=10, aes(x=x,y=x,shape=x)) +
facet_wrap(~ x, scales='free') + xlab('') + ylab('') +
scale_shape_identity() +
opts(axis.text.x=theme_blank(), axis.text.y=theme_blank(),
axis.ticks=theme_blank(), legend.position='none')
ggsave('symbols.png')

#Line types
ggplot(data=data.frame(x=c(1:6))) + geom_hline(size=2, aes(yintercept=x, linetype=x)) +
scale_linetype_identity() +
xlab(NULL) + ylab(NULL) + xlim(c(0,100)) +
opts(axis.text.x=theme_blank(), axis.ticks=theme_blank(), legend.position='none')
ggsave('linetypes.png', width=6.5, height=2)

#Colors
df = data.frame(x=rep(1:26, 26), y=rep(1:26, each=26))
df$c = NA
df[1:length(colors()),'c'] = colors()
df$n = NA
df[1:length(colors()),'n'] = 1:length(colors())
df$r = df$g = df$b = NA
df[1:length(colors()),c('r','g','b')] = t(col2rgb(colors()))
df$text = ifelse(apply(df[,c('r','g','b')], 1, sum) > (255*3/2), 'black', 'white')
df$hex = lapply(df$c, getColorHexAndDecimal)
df$hex2 = paste(format(df$n, width=3), format(df$c, width=(max(nchar(df$c))+1)), format(df$hex, width=(max(nchar(df$hex))+1)))

ggplot(df, aes(x=x, y=y, fill=c, label=n)) + geom_tile() + geom_text(aes(colour=text), size=3) +
scale_fill_identity() +
scale_colour_identity() +
xlab(NULL) + ylab(NULL) +
opts(axis.text.x=theme_blank(), axis.ticks=theme_blank(), plot.margin=unit(c(0,0,0,0), "cm"),
axis.text.y=theme_blank(), axis.ticks=theme_blank(), legend.position='none')
ggsave('colors.png')

ggplot(df[1:100,], aes(x=1, y=n, fill=c, label=hex2, colour=text)) +
geom_tile() + geom_text(family = 'mono') +
scale_fill_identity() +
scale_colour_identity() +
xlab(NULL) + ylab(NULL) +
opts(axis.text.x=theme_blank(), axis.ticks=theme_blank(), plot.margin=unit(c(0,0,0,0), "cm"),
axis.text.y=theme_blank(), axis.ticks=theme_blank(), legend.position='none')



Setting Up and Customizing R

For the longest time I resisted customizing R for my particular environment. My philosophy has been that each R script for each separate analysis I do should be self contained such that I can rerun the script from top to bottom on any machine and get the same results. This being said, I have now encountered a situation where there are multiple researchers using multiple R instances on multiple platforms. Briefly, our setup involves a shared network drive that is accessed using Rstudio on Windows as well as Rstudio Server running on Linux. I have created a setup R script file that sets the library location to the shared drive (using the oDrive variable also set globally), install the collection of R packages we regularly use, or update them, and setup the .Rprofile script appropriately for the platform the script is running on. I have also included the Rprofile script we are using.

Please provide comments about how you customize R for your needs.

My setup R script.

#List of most used R packages that we wish to install.
libraries = c('cacheSweave', 'Deducer', 'devtools', 'doBy', 'foreign', 'gdata',
'ggplot2', 'Hmisc', 'JGR', 'lubridate', 'maps', 'mapdata', 'mapproj',
'maptools', 'proto', 'psych', 'R2wd', 'RCurl', 'reshape',
'RODBC', 'roxygen2', 'seqinr', 'sm', 'sp', 'sqldf', 'survey',
'WriteXLS', 'XML', 'xtable')

#We will install packages from the main CRAN site
repos = 'http://cran.r-project.org'
#Site provides some prebuilt binaries for Windows
repos.win = 'http://www.stats.ox.ac.uk/pub/RWin'
#URL for the R-Forge repository.
repos.rforge = 'http://r-forge.r-project.org'

#Type of download for install.packages. Default to source
.type = 'source'
.libLoc = ''

if(Sys.info()['sysname'] == 'Windows') {
#TODO: Install Oracle client
#path = Sys.getenv('PATH')
#path = paste('c:\\instantclient_11_1;',path, sep='')
#system(paste('Setx PATH "', path, '"', sep=''))

oDrive = 'o:/'
memory.limit(size=4095) #Set to 4GB (for 64-bit Windows, this can be much larger)
#Set the R_LIBS environment varaible. This assumes the Setx program is installed
#(this is included with the Service Pack 2 Support Tools)
system(paste('Setx R_LIBS ', oDrive, 'R/library/Windows', sep=''))
#We will manually add to libPaths since the system variable will not take
#effect until R is restarted.
.libLoc = paste(oDrive, 'R/library/Windows', sep='')
repos = c(repos, repos.win) #Add the Windows repos for use below
# Set system environment varaible R_PROFILE
system(paste('Setx R_PROFILE ', oDrive, 'R/Rprofile', sep=''))
.type = 'win.binary'
} else if(Sys.info()['sysname'] == 'Linux') {
oDrive = '/n01/OutcomesAssessment/'
file.copy(paste(oDrive, 'R/Rprofile', sep=''), '~/.Rprofile', overwrite=TRUE)
#.libPaths(paste(oDrive, 'R/library/Linux', sep=''))
.libLoc = paste(oDrive, 'R/library/Linux', sep='')
} else {
stop('Unsupported operating system!')
}

.libPaths(.libLoc)

is_installed <- function(mypkg) is.element(mypkg, installed.packages()[,1])

#Only load packages that are not already installed.
for(l in libraries) {
if(!is_installed(l)) {
install.packages(l, repos=repos, dep=TRUE, lib=.libLoc, type=.type,
destdir=paste(oDrive, 'R/library/download', sep=''))
}
}

#Make sure we have the latest versions of the packages
update.packages(repos=repos, ask=FALSE, lib.loc=.libLoc,
destdir=paste(oDrive, 'R/library/download', sep=''))

#Load some packages from Github. We won't bother checking if they are already
#installed since these tend to change more often.
library(devtools)
install_github('irutils', 'jbryer', lib.loc=.libLoc)
install_github('ruca', 'jbryer', lib.loc=.libLoc)
install_github('ipeds', 'jbryer', lib.loc=.libLoc)
install_github('makeR', 'jbryer', lib.loc=.libLoc)
install_github('qualtrics', 'jbryer', lib.loc=.libLoc)

#Install ecir package
install(paste(oDrive, 'R/ecir', sep=''), lib.loc=.libLoc)

message("Installation of R packages complete. Please restart R now...")

view raw setup.r This Gist brought to you by GitHub.

My custom .Rprofile file.

# .Rprofile -- commands in this file will be executed at the beginning of
# each R session. On Windows, the R_PROFILE environment variable must have value
# with the full path to this file. On Linux (or other Unix like systems) this file
# must be in the user's home directory.

# Set the default repository to the main CRAN site
options(repos=c(CRAN='http://cran.r-project.org'))

# Set the oDrive varaible and library path
if(Sys.info()['sysname'] == 'Windows') {
oDrive = 'o:/'
.libPaths(paste(oDrive, 'R/library/Windows', sep=''))
require(utils, quietly=TRUE)
print("Setting memory limit for R to 4GB...")
memory.limit(size=4095) #Set to 4GB (for 64-bit Windows, this can be much larger)
message(paste('R_LIBS: ', Sys.getenv('R_LIBS'), sep=''))
message(paste('R_PROFILE: ', Sys.getenv('R_PROFILE'), sep=''))
} else if(Sys.info()['sysname'] == 'Linux') {
oDrive = '/n01/OutcomesAssessment/'
.libPaths(paste(oDrive, 'R/library/Linux', sep=''))
} else if(Sys.info()['sysname'] == 'Darwin') {
oDrive = NULL
}

# Customize the default look and feel of ggplot2
if(require(ggplot2, quietly=TRUE)) {
theme_update(panel.background=theme_blank(),
panel.grid.major=theme_blank(),
panel.border=theme_blank())
}

# On Linux we will alter the default behavior of the makeR package.
if(Sys.info()['sysname'] == 'Linux') {
if(require(makeR, quietly=TRUE)) {
setAutoOpen(FALSE)
setDefaultBuilder(builder.rnw.native)
}
}

# Change the location of the SQL repository
if(require(irutils, quietly=TRUE)) {
setSQLRepos(paste(oDrive, 'R/ecir/data', sep=''))
}

view raw Rprofile.R This Gist brought to you by GitHub.


makeR: An R Package for Managing Document Building and Versioning

If you are reading this vis-à-vis R-Bloggers, then you know how good R, LaTeX, and Sweave are for generating reports and/or conducting reproducible research. It has been particularly valuable for me in Institutional Research where there are many reports that I need to prepare on a regular basis (some monthly, some quarterly, some annually). However, one issue I have had is how to do I track these different reports. I have tried manual processes such as copying and pasting R scripts to more automated approaches such as GNU Make. Though the latter is a robust system, I have had two issues with it recently, namely it is more complex and it requires access to the shell. My goal is to manage the documents entirely in R because our office is moving to using RStudio Server which means we don’t have access to the shell. The makeR package is my attempt to solve this problem.

I should start by stating what the makeR package IS NOT. It is not a system for package development, use devtools for that. It is also not really meant to handle very complex research projects, use ProjectTemplate for that. Instead, makeR attempts to solve the problem of generating a series of documents that are the same with the exception of easily extractable variables. The package provides methods for handling LaTeX, Sweave, cacheSweave, and knitr documents. The package vignette describes how to create custom builders including one for R script files that generate PNG files. I have reduced the build process to four R statements:

  1. Create or load a project. The package uses an XML file to store project information.
  2. Add a new version.
  3. Build the version.
  4. Release the version

By default, a project will have three directories: source, build, and release. Upon each build, source files are copied from the source directory to a subdirectory named build. A builder function will then execute the appropriate code. Execution occurs (by default) in a separate R environment, saves output to a log file, and will also save the R environment to the build directory to assist with debugging. Releasing will copy the built file(s) to the release directory renaming the files to include the version information (i.e. version name/number and minor version number).

Version 1.0 has been released to CRAN. The package vignette contains more detail. There are also three demos

  1. rbloggers demonstrates a simple Sweave (Rnw) document that summarizes posts within a particular month to the R-Bloggers site.
  2. stocks shows how to use an R script file that generates PNG files as output.
  3. makeR-knitr shows how to use the knitr package for document building.

makeR is hosted on Github and there is a project page with more documentation at jbryer.github.com/makeR.

 


Given a room with n people in it, what is the probability any two will have the same birthday?

Revisiting a fun puzzle I remember first encountering as an undergraduate. Nice example of creating a plot in R using ggplot2. I also plot the probability of someone in the room having the same birthday as you.

## See http://en.wikipedia.org/wiki/Birthday_problem for an explanation of the problem
require(ggplot2)
require(reshape)

theme_update(panel.background=theme_blank(),
panel.grid.major=theme_blank(),
panel.border=theme_blank())

birthday <- function(n) {
1 - exp( - n^2 / (2 * 365) )
}

myBirthday <- function(n) {
1 - ( (365 - 1) / 365 ) ^ n
}

d = 200
df = data.frame(n=1:d, AnyTwoSame=birthday(1:d), SameAsMine=myBirthday(1:d))
df = melt(df, id.vars='n')

ggplot(df, aes(x=n, y=value, colour=variable)) + geom_line() + scale_colour_hue('') +
xlab('Number of People in Group') + ylab('Probability')