Reading CDC mortality files using R

Reading fixed-width text files might be challenging, specially when we don’t have a dictionary file. In this post, I show steps to read CDC files in a more systematic way. In this example, I import a compress mortality file (CMF 1979-1988) available here and whose codebook (or layout) is here.

To read this file, usually with extension .txt or .dat, I first need to know where each column starts and finishes. What I get from the pdf file is something like this:

The layout is usually a codebook in Word/PDF or just plain text file. Here, I copy the PDF text and put it in a plain text file. I use a text editor (e.g., Sublime Text) and regular expressions to extract the information I need.

I have to select every row with this pattern: 1-2 2 FIPS State code Numeric. That is, a number followed by a hyphen (although not always, particularly when the width of the column is one), spaces, another number, spaces, and then any text. I use the following regular expression to get that pattern: (^[0-9]+).([0-9]+)\s+([0-9])\s+(.+). Using the Sublime package Filter Lines I get something like this (you can also just copy the selected lines):

1-2 2 FIPS State code Numeric
3-5 FIPS county code Numeric
6-9 4 Year of death Numeric
11-12 2 Age at death Numeric
13-16 4 ICD code for underlying cause-of-death 3 digits: Numeric
17-19 3 Cause-of-Death Recode Numeric
20-23 4 Number of deaths Numeric

This approach might be particularly useful when you have a long PDF/Word file and you want to extract most of the variables. You would need to adapt the regular expressions I’m using to the particular patterns of your codebook.

To simplify, I format this text as a comma-separated values file (csv). Replacing this regular expression ([0-9]+)(-)([0-9]+)(\s)([0-9]+)(\s)(.+)(\s)(Numeric) by \1,\3,\5,\7,\9 I get:

1,2,2,FIPS State code,Numeric
3,5,3,FIPS county code,Numeric
6,9,4,Year of death,Numeric
11,12,2,Age at death,Numeric
13,16,4,ICD code for underlying cause-of-death 3 digits:,Numeric
17,19,3,Cause-of-Death Recode,Numeric
20,23,4,Number of deaths,Numeric

Then, I read the layout file:

    # define names of columns
    colnames <- c("start", "end", "width", "name", "type")
    dict <- read.csv("data/dictMortality.csv", col.names = colnames, header = FALSE)
    
    ##   start end width                                             name    type
    ## 1     1   2     2                                  FIPS State code Numeric
    ## 2     3   5     3                                 FIPS county code Numeric
    ## 3     6   9     4                                    Year of death Numeric
    ## 4    11  12     2                                     Age at death Numeric
    ## 5    13  16     4 ICD code for underlying cause-of-death 3 digits: Numeric
    ## 6    17  19     3                            Cause-of-Death Recode Numeric
    ## 7    20  23     4                                 Number of deaths Numeric
    

Now, I can read the fixed-width data file. I use the readr package (in my experience relatively fast for big datasets ~ 1 GB).

   library(readr)
   
   # create name of variables
   cnames <- c("state", "county", "year", "age", "icd", "cause", "deaths")
   
   # read mortality file
   mort <- read_fwf("data/mort7988.txt", fwf_positions(dict$start, dict$end, cnames))
   
   ## # A tibble: 8,776,385 x 7
   ##    state county  year   age   icd cause deaths
   ##    <chr>  <chr> <int> <chr> <chr> <chr>  <int>
   ##  1    01    001  1979    04  5789   780      1
   ##  2    01    001  1979    04  7980   770      1
   ##  3    01    001  1979    08  8121   800      1
   ##  4    01    001  1979    09  3439   780      1
   ##  5    01    001  1979    09  8120   800      2
   ##  6    01    001  1979    09  8189   800      1
   ##  7    01    001  1979    10  1629   180      1
   ##  8    01    001  1979    10  2396   250      1
   ##  9    01    001  1979    10  4289   410      1
   ## 10    01    001  1979    10  8070   810      1
   ## # ... with 8,776,375 more rows
   
       # year distribution
       table(mort$year)
    
    ## 
    ##   1979   1980   1981   1982   1983   1984   1985   1986   1987   1988 
    ## 831605 854860 854198 850505 867280 875607 894176 905736 912551 929867
    
       # number of deaths
       sum(mort$deaths)
    
    ## [1] 20398153
    

Hopefully, this might save you some time!

Last Update: 06/29/2017




blog comments powered by Disqus