To store the detailed records the SQL novice would construct one table pr. file and exclude any constant columns since these are redundant (given in summary table). These detailed tables would then have different column schemas - (mn,tmp,wind) for ~58%, (hr,mn,tmp,wind) for ~40%, (d_o_y,hr,mn,tmp,wind) for ~2% and (yr,d_o_y,hr,mn,tmp,wind) for ~0.005%.
I don't understand this at all - why would you have one table per file? Why not effectively append all of your per-file tables into one huge table, with an additional column indicating which file the data comes from? Then you have only two tables, one with a row for each observation in your data, one with a row for each file. Some queries need to join the two tables, but that's not a big deal.
There may be a lot of redundancy in your data, but that's life. If it's not deterministic redundancy (e.g., file ID => station ID), then I would just live with it.
It also seems to me that you may be tying your schema design too closely to the current way that the data is represented. Do you really need to have the data file figure so prominently in your design? If you do need to keep track of which file an observation came from, I might have one huge table of observations, each with an observation ID, and have another table that maps filenames to ranges of observation IDs (filename, startObs, endObs). A range query could then quickly find which table is responsible for an observation ID.
Hmm, in fact if the redundant values you're worried about come in long stretches (e.g., temperature is the same for many observations in a row), I suppose you could do the same thing - map a constant value to the range of observation IDs for which it holds. This gets back to having many tables, though.
- John Burger MITRE ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match