Re: SQL - planet redundant data

"John D. Burger" <john@xxxxxxxxx> · Fri, 9 Sep 2005 09:06:17 -0400

To store the detailed records the SQL novice would construct one table 
pr. file and exclude any constant columns since these are redundant 
(given in summary table). These detailed tables would then have 
different column schemas - (mn,tmp,wind) for ~58%, (hr,mn,tmp,wind) 
for ~40%, (d_o_y,hr,mn,tmp,wind) for ~2% and (yr,d_o_y,hr,mn,tmp,wind) 
for ~0.005%.

I don't understand this at all - why would you have one table per file? 
 Why not effectively append all of your per-file tables into one huge 
table, with an additional column indicating which file the data comes 
from?  Then you have only two tables, one with a row for each 
observation in your data, one with a row for each file.  Some queries 
need to join the two tables, but that's not a big deal.

There may be a lot of redundancy in your data, but that's life.  If 
it's not deterministic redundancy (e.g., file ID => station ID), then I 
would just live with it.

It also seems to me that you may be tying your schema design too 
closely to the current way that the data is represented.  Do you really 
need to have the data file figure so prominently in your design?  If 
you do need to keep track of which file an observation came from, I 
might have one huge table of observations, each with an observation ID, 
and have another table that maps filenames to ranges of observation IDs 
(filename, startObs, endObs).  A range query could then quickly find 
which table is responsible for an observation ID.

Hmm, in fact if the redundant values you're worried about come in long 
stretches (e.g., temperature is the same for many observations in a 
row), I suppose you could do the same thing - map a constant value to 
the range of observation IDs for which it holds.  This gets back to 
having many tables, though.

- John Burger
  MITRE

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
      choose an index scan if your joining column's datatypes do not
      match