For the final project in Prof. Jake Porway‘s Data Without Borders: Data Science in the Service of Humanity class at NYU-ITP, I chose to work with the US veterans and beneficiaries gravesite datasets that have been published at data.gov. The Department of Veterans Affairs in 2004 started working on a nationwide gravesite locator that allowed for this data.
Gathering the Datasets
The data is unfortunately aggregated only at the state level, but it at least is updated regularly, so I ended up pulling 51 .csv (comma-separated values) files from the site with the publish date of October 2012. The categories found in the data:
d_first_name, d_mid_name, d_last_name, d_suffix, d_birth_date, d_death_date, section_id, row_num, site_num, cem_name, cem_addr_one, cem_addr_two, city, state, zip, cem_url, cem_phone, relationship, v_first_name, v_mid_name, v_last_name, v_suffix, branch, rank, war
Since we would have to apply what we’d learned working with the R language to our dataset, what I hoped was that I could use the gravesite data, which goes back to the 1800’s or even earlier, to see how where veterans end up being buried correlates with national population trends over time. In other words, if many Americans are buried in California, does this mean more veterans are also going to be buried there?
I figured, since there were categories for branch, rank, and war, that I’d be able to find some logical correlations: many privates and junior-enlisted sergeants would have died, while fewer senior-enlisted and officers would have, in past wars. I figured the d_death_year might correlate with dates for the US’s multiple wars, with deaths elevated during those time periods.
So I guessed that with 51 data sets, this would begin to fill up my system RAM (4GB on this MacBook Air). Look at my memory usage!
I skipped the files for the US-owned territories and “foreign addresses” since I wouldn’t be able to find normalized population data for those. I also cleaned up the dataset so it would return only the veteran, not his/her beneficiaries who may also have been listed in the data as being deceased.
Loading the Data into R
Given that I’m not very comfortable with R, I started out just loading the Washington DC dataset since it only has 986 entries in it. Problem? The download for the file didn’t work. “ngl_washington%20dc.csv” was not found. %20 is a URL-encoded representation for a blank space. Luckily, getting rid of the %20 revealed the proper filename, “ngl_washingtondc.csv”. I also found that the .csv files were not importing into RStudio immediately. I’d get an error. I would get something similar to an uneven rows error. What I had to do for each of the states’ .csv files was to open them up first in Excel and then save them in Excel. Excel would properly format the files so they could be imported into RStudio.
The next step was to break up d_death_date (which was in a variety of formats such as “1993”, “9/3/98”, and “07/11/1864”) so that I could extract the year. I had to check for the number of characters in the string, then figure out if the year was 2 digits or 4. If it was 4 digits, I knew the year for sure (e.g. “2008”). If 2 digits, I figured that if it were less than “15”, then it was probably referring to the year 2000 and higher (lazy data entry). If higher than “15”, it probably was assumed to be the 20th century. Finally, I had to convert this result from a string result to a numeric so I could do math on it.
More below the jump…