On Thu, Jan 5, 2012 at 10:19 PM, Darren Duncan <darren@xxxxxxxxxxxxxxxx> wrote:
This all being said, 23K values per row just sounds wrong, and I can't imagine any census forms having that many details.
Do you, by chance, have multiple values of the same type that are in different fields, eg telephone_1, telephone_2 or child_1, child_2 etc? You should take any of those and collect them into array-typed fields, or separate tables with just telephone or child columns. Or do you say have a set of coordinates in separate fields? Or you may have other kinds of redundancy within single rows that are best normalized into separate rows.
With 23K values, these probably have many mutual associations, and you could split that table into a bunch of other ones where columns that relate more closely together are collected.
My question is already answered, so this is mostly for anyone curious about Census data, and if anyone wants to add to this, feel free.
You're right that no census form has that many details. The American Community Survey replaced the old Census long form, so it does have more details than the form sent out for the 2010 Census, but not 23,000 questions. It might ask, e.g. income, sex, and how you travel to work. But the forms are private, so that data (the so-called microdata) is not released in its entirety. What I am working with is called the summary file, which presents the data in aggregate. That means you have an income table with 16 income classes, plus a total column. Then you have 9 more tables which show the same income classes by 9 racial and ethnic categories, for a total of 153 more columns. Then you also have a table which crosses 9 income classes by 5 mobility statuses (living in same house, moved within county, moved within state, etc.) for a total of 55 columns. Then you have income classes crossed with sex, income classes crossed with mode of transportation to work, sex crossed with mode of transportation to work, etc. When all is said and done, you have 23,000 columns.
Believe me, I would all love to be working directly with the microdata. All of these different ways of slicing and dicing the categories are basically how the Census Bureau provides as much detail as possible without providing so much detail that privacy would be compromised (i.e., enough information is available that specific individuals could be identified). That inevitably leads to a great deal of redundancy in the data, since the same individuals are just being counted in different groups in different tables.
Given all that, one could still take the data that came from Census and try to normalize it and organize it, but my main goal given the size of the dataset is to keep it as similar as possible to the source, so that a researcher familiar with the Census data can work with our database using the Census documentation without our having to produce a new set of documentation. The way I had done that was to store the sequences (which are merely a data dissemination convenience, and have no relationship to the logic of the data) in Postgres tables, and make the subject tables (which do represent logical groupings of data) into views. I'm thinking about making the sequences into array columns. The subject tables would still be views.
--Lee
You're right that no census form has that many details. The American Community Survey replaced the old Census long form, so it does have more details than the form sent out for the 2010 Census, but not 23,000 questions. It might ask, e.g. income, sex, and how you travel to work. But the forms are private, so that data (the so-called microdata) is not released in its entirety. What I am working with is called the summary file, which presents the data in aggregate. That means you have an income table with 16 income classes, plus a total column. Then you have 9 more tables which show the same income classes by 9 racial and ethnic categories, for a total of 153 more columns. Then you also have a table which crosses 9 income classes by 5 mobility statuses (living in same house, moved within county, moved within state, etc.) for a total of 55 columns. Then you have income classes crossed with sex, income classes crossed with mode of transportation to work, sex crossed with mode of transportation to work, etc. When all is said and done, you have 23,000 columns.
Believe me, I would all love to be working directly with the microdata. All of these different ways of slicing and dicing the categories are basically how the Census Bureau provides as much detail as possible without providing so much detail that privacy would be compromised (i.e., enough information is available that specific individuals could be identified). That inevitably leads to a great deal of redundancy in the data, since the same individuals are just being counted in different groups in different tables.
Given all that, one could still take the data that came from Census and try to normalize it and organize it, but my main goal given the size of the dataset is to keep it as similar as possible to the source, so that a researcher familiar with the Census data can work with our database using the Census documentation without our having to produce a new set of documentation. The way I had done that was to store the sequences (which are merely a data dissemination convenience, and have no relationship to the logic of the data) in Postgres tables, and make the subject tables (which do represent logical groupings of data) into views. I'm thinking about making the sequences into array columns. The subject tables would still be views.
--Lee
Lee Hachadoorian
PhD, Earth & Environmental Sciences (Geography)
Research Associate, CUNY Center for Urban Research
http://freecity.commons.gc.cuny.edu/