I have ~500,000 data files each containing ~1,000 records that I want to
put into a database for easy access.
Fictive example for illustration: File w. meteorological data from a
given station.
stat_id | yr | d_o_y | hr | mn | tmp | wind
---------|------|-------|----|----|------|------
78 | 2001 | 211 | 14 | 0 | 15.3 | 4.7
78 | 2001 | 211 | 14 | 1 | 15.4 | 5.4
78 | 2001 | 211 | 14 | 2 | 15.4 | 5.0
.. | .... | ... | .. | . | .... | ...
78 | 2001 | 211 | 14 | 24 | 15.8 | 2.7
Some parameters are constant for any file, e.g. station id. Others are
constant for a varying percentage of the files (hr~58%, d_o_y~98%,
yr~99.995% due to the oddly chosen 25 min. pr. file)
It seems like a table with file summaries (1 row for each file) would be
very useful. Some of the columns/rows could be:
file | st_id | yr | d_o_y | avg_tmp | std_dev_tmp
-------------|-------|------|-------|---------|--------------
78_f6548.dat | 78 | 2001 | 211 | 15.5 | 0.24
57_f4296.dat | 57 | 2000 | -1 | 8.3 | 0.11
... | ... | ... | ... | ... | ...
(-1 for day of year indicates an odd file spanning across midnight)
To store the detailed records the SQL novice would construct one table
pr. file and exclude any constant columns since these are redundant
(given in summary table). These detailed tables would then have
different column schemas - (mn,tmp,wind) for ~58%, (hr,mn,tmp,wind) for
~40%, (d_o_y,hr,mn,tmp,wind) for ~2% and (yr,d_o_y,hr,mn,tmp,wind) for
~0.005%.
SQL will discourage this novice. To enable simple queries data tables
must be combined. Say I want to find all summer data with high
temperature standard deviation and low wind. If I made the ultimate
table for easy queries, containing all columns for both detailed data
and file summary, I could find the data like this:
SELECT file, std_dev_tmp, tmp, wind FROM huge_table WHERE
std_dev_tmp>0.2 AND d_o_y>150 AND d_o_y<240 AND wind<4.0
The ability to make this easy query comes with a ridiculous amount of
redundant data in huge_table. In order to remove all the stuff that is
already given by the summary table I must split the huge table into a
number of tables; one for each of above mentioned column schemas (in my
case many more than in the example above). Difficult to build and no
easy queries anyway - I'd need to write a C program to launch the many
queries that would in effect return the same data as the single query
above. Defeats the purpose of easy, efficient access.
This novice must be missing a sneaky way to avoid massive redundancy and
still maintain easy access. I've been suggested to look at inheritance
and foreign keys. Foreign keys I don't see how to use, but I could make
an inheritance hierarchy of the split up tables to make the parent table
columns (mn,tmp,wind in above example) available across tables. But
accessing the rest of the columns still require child-table specific
queries, and my many child tables will not be nicely nested as in above
example, so I still don't see how it can get me there. I could really
use some input.
One thought: PostgreSQL adds to every table a system column tableoid
containing a constant value. Is that value really stored 1,000 times for
a 1,000-row table? Or...?
(Apologies for the length of this post)
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend