One issue I continuously encounter when starting to work with a new dataset is that of the codebook. In general, I prefer to load a codebook into R like any other data source, specifically as a data frame. And ideally, one data frame to provides the variable names with descriptions and any other meta data available, and a separate list of named vectors that can be used to recode factors. Although there is no standard format for codebooks, most follow a similar format. This post outlines the parse.codebook
function that will read codebooks that have the following features:
- Each line in the file provides information about a variable (which I refer to as a variable row), or the mapping of factor (which I refer to as a level row).
- Variable rows start on the left edge (that is, there is a non-whitespace character at position 1 of the row).
- Level rows do not start on the left edge (that is, there is a whitespace character at position 1 of the row, for example a tab or space).
- Rows are either fixed (see
?read.fwf
for more information as to specifics) or character delimited (e.g. comma, colon, etc.).
Although all codebooks may not strictly adhere to these rules, it is often trivial, even if not a bit tedious, to reformat the file to adhere to these rules. Also, blank lines are permissible and will simply be ignored.
If the codebook file adheres to these rules, the parse.codebook
function will parse the file and return an object of type codebook
that inherits from data.frame
, therefore all the data frame functions are valid (e.g. head
, nrow
, names
, etc.). This data frame contains all the information about the variables vis-a-vis the variable rows. Information about factor levels are stored in a list
as an attribute
of the returned object which can be retrieved using attr(mycodebook, 'levels')
. Example from the Common Core of Data and the American Community Survey are provided below.
Installation
The source.codebook
function is currently provided on Gist. You can either download the R script file or source it directly from Gist using the devtools
package.
require(devtools)
source_gist(4497585)
Parameters
The parse.codebook
has a number of parameters to indicate the format of variable and level rows. The function will handle both character delimited rows and fixed with rows. Therefore, either var.sep
or var.widths
must be specified as well as level.sep
or level.widths
. The available parameters are:
file
codebook file name.var.names
the name of the columns for variable rows.level.names
the name of the columns for level rows.var.sep
the separator for variable rows.level.sep
the separator for level rows.level.indent
character vector providing character(s) at the beginning of the line that indicate the line represents a factor level. Each element should have 1 character as only the first character of the line is compared.var.name
the name invar.names
that represents the variable name. This should be a valid R variable name as this will be the column name in the corresponding data file, as well as the name used in thelist
of levels stored as an attribute to the returned object.
Example One: Common Core of Data
The Common Core of Data (CCD) is a dataset provided by the National Center for Education Statistics that provides information about K-12 schools in the United States. The codebook provided is in plain text and required two modifications: One, general file information at the top of the file was deleted, and two, any descriptions that spanned lines need to be modified so the are on only one line. Here are the first 15 lines of the modified file, the full file can be downloaded at here
SURVYEAR 1 AN Year corresponding to survey record.
NCESSCH 2 AN Unique NCES public school ID (7-digit NCES agency ID (LEAID) + 5-digit NCES school ID (SCHNO).
FIPST 3 AN American National Standards Institute (ANSI) state code..
01 = Alabama
02 = Alaska
04 = Arizona
05 = Arkansas
06 = California
08 = Colorado
09 = Connecticut
10 = Delaware
11 = District of Columbia
This codebook uses fixed withs for variable rows, and separators (using the equal sign) for level rows (although it also possible to use fixed with for level rows as well). First, we will parse the file:
ccd.codebook <- parse.codebook('ccdCodebook.txt',
var.names=c('variable','order','type','description'),
level.names=c('level','label'),
level.sep='=',
var.widths=c(13, 7, 7, Inf) )
Here are the first six rows of the returned data frame.
> head(ccd.codebook)
linenum variable order type description isfactor
1 1 SURVYEAR 1 AN Year corresponding to survey record. FALSE
2 3 NCESSCH 2 AN Unique NCES public school ID (7-digit NCES agency ID (LEAID) + 5-digit NCES school ID (SCHNO). FALSE
3 5 FIPST 3 AN American National Standards Institute (ANSI) state code.. TRUE
4 67 LEAID 4 AN NCES local education agency (LEA) ID. FALSE
5 69 SCHNO 5 AN NCES school ID. FALSE
6 71 STID 6 AN State?s own ID for the education agency. FALSE
In addition to the columns corresponding to var.names
, the function also returns a linenum
and isfactor
column. The former is an integer corresponding to the line number in the original file from which this row was parsed. This is useful for tracking down issues in the parsing or text formatting. The isfactor
is a logical column indicating whether there are factor levels specified for that variable. Factor levels can be retrieved as follows:
> ccd.var.levels <- attr(ccd.codebook, 'levels')
> names(ccd.var.levels)
[1] "FIPST" "TYPE" "STATUS" "TITLEI" "STITLI" "MAGNET" "CHARTR" "SHARED"
> ccd.var.levels[['TYPE']]
linenum level label
1 103 1 Regular school
2 105 2 Special education school
3 107 3 Vocational school
4 109 4 Other/alternative school
5 111 5 Reportable program
Example Two: American Community Survey
The American Community Survey is the current version of the Census Long Form. The codebook provided by the United Census Bureau is in PDF format, but is easily converted to a plain text file. This file required more modification that the CCD file described above, mostly removing line numbers that pasted over from the PDF as well as ensuring that descriptions did not span lines. The final modified version can be downloaded (here)[http://jason.bryer.org/codebook/acsPersonCodebook.txt]. Here are the first 10 lines of the file:
SPORDER .Person number
ST .State Code
01 .Alabama/AL
02 .Alaska/AK
04 .Arizona/AZ
05 .Arkansas/AR
06 .California/CA
08 .Colorado/CO
09 .Connecticut/CT
10 .Delaware/DE
For this codebook file, all rows are character delimited on .
(space period). We parse the file as follows:
acs.codebook <- parse.codebook('acsPersonCodebook.txt',
var.names=c('var','desc'),
level.names=c('level','label'),
var.sep=' .', level.sep=' .')
The first six lines of the returned data frame are:
> head(acs.codebook)
var desc linenum isfactor
1 SPORDER Person number 1 FALSE
2 ST State Code 2 TRUE
3 ADJINC Adjustment factor for income and earnings dollar amounts (6 implied decimal places) 55 FALSE
4 PWGTP Person's weight 56 FALSE
5 AGEP Age 57 FALSE
6 CIT Citizenship status 58 TRUE
And factor levels:
> var.levels <- attr(acs.codebook, 'levels')
> names(var.levels)
[1] "ST" "CIT" "COW" "DRAT" "ENG" "GCM" "JWRIP" "JWTR" "MAR" "MARHM"
[11] "MARHT" "MARHW" "MIG" "MIL" "NWAV" "RELP" "SCH" "SCHG" "SCHL" "SEX"
[21] "WKL" "WKW" "WRK" "ANC" "ANC1P" "ANC2P" "DECADE" "DIS" "DRIVESP" "ESP"
[31] "ESR" "FOD1P" "6402" "FOD2P" "HICOV" "HISP" "INDP" "JWAP" "JWDP" "LANP"
[41] "MIGSP" "MSP" "NAICSP" "NOP" "OCCP02" "OCCP10" "PAOC" "POBP" "POWSP" "PRIVCOV"
[51] "PUBCOV" "QTRBIR" "RAC1P" "RAC2P" "RAC3P" "SFN" "SFR" "SOCP00" "SOCP10" "VPS"
[61] "WAOB" "FHINS3C" "FHINS4C" "FHINS5C"
> var.levels[['CIT']]
linenum level label
1 59 1 Born in the U.S.
2 60 2 Born in Puerto Rico, Guam, the U.S. Virgin Islands, or the Northern Marianas
3 61 3 Born abroad of American parent(s)
4 62 4 U.S. citizen by naturalization
5 63 5 Not a citizen of the U.S.
Conclusion
Although a standard codebook format doesn’t exist, most adopt a similar format. I have outlined the parse.codebook
function that, with minimal reformatting of the original codebook file, be used to read a codebook into R. This is tremendously useful as we can now merge in variable descriptions when creating tables and figures, as well as recode factors with their longer descriptions in an automated fashion.