SQL - planet redundant data

Poul Jensen <flyvholm@xxxxxxxxx> · Thu, 08 Sep 2005 19:58:55 -0800

I have ~500,000 data files each containing ~1,000 records that I want to 
put into a database for easy access.
Fictive example for illustration: File w. meteorological data from a 
given station.

stat_id |  yr  | d_o_y | hr | mn | tmp  | wind
---------|------|-------|----|----|------|------
   78   | 2001 |  211  | 14 |  0 | 15.3 |  4.7
   78   | 2001 |  211  | 14 |  1 | 15.4 |  5.4
   78   | 2001 |  211  | 14 |  2 | 15.4 |  5.0
   ..   | .... |  ...  | .. |  . | .... |  ...
   78   | 2001 |  211  | 14 | 24 | 15.8 |  2.7

Some parameters are constant for any file, e.g. station id. Others are 
constant for a varying percentage of the files (hr~58%, d_o_y~98%, 
yr~99.995% due to the oddly chosen 25 min. pr. file)

It seems like a table with file summaries (1 row for each file) would be 
very useful. Some of the columns/rows could be:

   file     | st_id |  yr  | d_o_y | avg_tmp | std_dev_tmp
-------------|-------|------|-------|---------|--------------
78_f6548.dat |   78  | 2001 |  211  |   15.5  |    0.24
57_f4296.dat |   57  | 2000 |   -1  |   8.3   |    0.11
   ...      |  ...  |  ... |  ...  |   ...   |    ...
(-1 for day of year indicates an odd file spanning across midnight)

To store the detailed records the SQL novice would construct one table 
pr. file and exclude any constant columns since these are redundant 
(given in summary table). These detailed tables would then have 
different column schemas - (mn,tmp,wind) for ~58%, (hr,mn,tmp,wind) for 
~40%, (d_o_y,hr,mn,tmp,wind) for ~2% and (yr,d_o_y,hr,mn,tmp,wind) for 
~0.005%.

SQL will discourage this novice. To enable simple queries data tables 
must be combined. Say I want to find all summer data with high 
temperature standard deviation and low wind. If I made the ultimate 
table for easy queries, containing all columns for both detailed data 
and file summary, I could find the data like this:

SELECT file, std_dev_tmp, tmp, wind FROM huge_table WHERE 
std_dev_tmp>0.2 AND d_o_y>150 AND d_o_y<240 AND wind<4.0

The ability to make this easy query comes with a ridiculous amount of 
redundant data in huge_table. In order to remove all the stuff that is 
already given by the summary table I must split the huge table into a 
number of tables; one for each of above mentioned column schemas (in my 
case many more than in the example above). Difficult to build and no 
easy queries anyway - I'd need to write a C program to launch the many 
queries that would in effect return the same data as the single query 
above. Defeats the purpose of easy, efficient access.

This novice must be missing a sneaky way to avoid massive redundancy and 
still maintain easy access. I've been suggested to look at inheritance 
and foreign keys. Foreign keys I don't see how to use, but I could make 
an inheritance hierarchy of the split up tables to make the parent table 
columns (mn,tmp,wind in above example) available across tables. But 
accessing the rest of the columns still require child-table specific 
queries, and my many child tables will not be nicely nested as in above 
example, so I still don't see how it can get me there. I could really 
use some input.

One thought: PostgreSQL adds to every table a system column tableoid 
containing a constant value. Is that value really stored 1,000 times for 
a 1,000-row table? Or...?

(Apologies for the length of this post)

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend