While in between major projects at work, I have been looking for various things to investigate and learn. IT is a field which requires constant learning to keep one’s skills fresh and relevant!
I decided to look for large, publicly available data sets which would be useful for analysis, particularly with a mind to first fully normalizing the data set (if not already normalized into at least third normal form) and then subsequently denormalizing it into a star schema format for use in data warehousing.
There are various data sets out there, but the most interesting to me was the United States Census. I did some searching and did not find other data sets which had the same volume/size, and were easily imported into a relational database. I am sure that others could point out many data sets that I overlooked but this was a good place to start.
What surprised me as I looked into the Census data was how far away from third normal form or any sort of true relationalized format it was. It is really a bunch of scattered files of varying types. What I wanted was to look at an Entity Relationship Diagram and see all of the exact entities, columns, fields, etc., defined and the relationships between each entity. I then expected to be able to download each of these tables in at least a csv file or something similar and column-for-column, row- for-row, upload these tables into a database (Oracle was my preferred choice in this case) and already be pretty close to being able to denormalize selective portions of the data. I didn’t find anything even close to this!
Here is the closed thing to an ERD that I could find: http://www.census.gov/geo/reference/pdfs/geodiagram.pdf
Now keep in mind for now I was only interested in the 2010 census data, not previous censuses (which in the US happen every ten years) nor the American Community Survey data which is taken on a yearly basis. So we are not even getting into the complexities of longitudinally comparing censuses from different years, just looking at the 2010 census.
After speaking with a friend of mine who is an economics professor, it seems likely to me that the data was organized and uploaded by people who are primarily economists and statisticians and not data modelers. I say this because my friend wasn’t aware of what data normalization was, never mind things like referential integrity or check constraints, etc. I think not being aware of at least what the normal forms are and why they are useful poses problems when you try to work with larger datasets.
This will be an ongoing project, and I plan to upload all documents and scripts that I have created in this process for others to use and verify.