Reading fixed-width text files might be challenging, specially when we don’t have a dictionary file. In this post, I show steps to read CDC files in a more systematic way. In this example, I import a compress mortality file (CMF 1979-1988) available here and whose codebook (or layout) is here.
To read this file, usually with extension
.dat, I first need to know where each column starts and finishes. What I get from the pdf file is something like this:
The layout is usually a codebook in Word/PDF or just plain text file. Here, I copy the PDF text and put it in a plain text file. I use a text editor (e.g., Sublime Text) and regular expressions to extract the information I need.
I have to select every row with this pattern:
1-2 2 FIPS State code Numeric. That is, a number followed by a hyphen (although not always, particularly when the width of the column is one), spaces, another number, spaces, and then any text. I use the following regular expression to get that pattern:
(^[0-9]+).([0-9]+)\s+([0-9])\s+(.+). Using the Sublime package Filter Lines I get something like this (you can also just copy the selected lines):
1-2 2 FIPS State code Numeric 3-5 FIPS county code Numeric 6-9 4 Year of death Numeric 11-12 2 Age at death Numeric 13-16 4 ICD code for underlying cause-of-death 3 digits: Numeric 17-19 3 Cause-of-Death Recode Numeric 20-23 4 Number of deaths Numeric
This approach might be particularly useful when you have a long PDF/Word file and you want to extract most of the variables. You would need to adapt the regular expressions I’m using to the particular patterns of your codebook.
To simplify, I format this text as a comma-separated values file (csv). Replacing this regular expression
\1,\3,\5,\7,\9 I get:
1,2,2,FIPS State code,Numeric 3,5,3,FIPS county code,Numeric 6,9,4,Year of death,Numeric 11,12,2,Age at death,Numeric 13,16,4,ICD code for underlying cause-of-death 3 digits:,Numeric 17,19,3,Cause-of-Death Recode,Numeric 20,23,4,Number of deaths,Numeric
Then, I read the layout file:
Now, I can read the fixed-width data file. I use the readr package (in my experience relatively fast for big datasets ~ 1 GB).
Hopefully, this might save you some time!
Last Update: 06/29/2017
blog comments powered by Disqus