Re: SQL - planet redundant data

Poul Jensen <flyvholm@xxxxxxxxx> · Sun, 11 Sep 2005 23:00:02 -0800

Thank you for your input! Individual replies follow below.

##############################

Chris Travers wrote:

Ok.  Imagine two huge huge tables:

file_data    and additional_data

create_table file_data (
   file_id serial primary key,
   station_id text,
    ....
);

create table additional_data (
   data_id bigserial primary key,
   file_id int REFERENCES file_data (file_id),
   temp,
   ....
);

So the 500 million entries in additional_data.file_id are now restricted 
to belong to the same group of values found in file_data.file_id. But I 
don't see how this can get me rid of redundant data - I still have 500M 
entries? Consider the column "year". My 500M records come from 13 years. 
Saving 500M values when only 13 are distinct is very redundant.

Also you can normalize your way out of the redundant data problem.

I can tell how to do this  for filename and station name - I save 
something by saving an integer rather than a text string. But all the 
rest of my columns are already smallints. Do I save anything by storing  
"5" rather than "1997"?

I would *highly* recommend waiting until 8.1 goes live to impliment 
this in production.

It will be built/rebuilt after 8.1 is released.

##############################

John D. Burger wrote:

Why not effectively append all of your per-file tables into one huge 
table, with an additional column indicating which file the data comes 
from?  Then you have only two tables, one with a row for each 
observation in your data, one with a row for each file.  Some queries 
need to join the two tables, but that's not a big deal.

That big table with a row for each observation will have to include all 
columns, such as year and station_id. For such columns I store 500M 
values of which only ~15 are distinct in my case - the redundancy I'm 
trying to minimize. By splitting into smaller tables, many columns will 
contain only one distinct value. It *should* be possible to delete such 
columns and instead store their corresponding values in a "table 
header". If I had one table pr. file, the table you suggest with one row 
pr. file could effectively serve as the collection of table headers.

It also seems to me that you may be tying your schema design too 
closely to the current way that the data is represented.  Do you 
really need to have the data file figure so prominently in your design?

The archive of data files is shared by a number of research groups 
around the world. There's a whole software package that people use for 
data analysis, accessing the data files as they are. So I expect a lot 
of file-specific queries.

Hmm, in fact if the redundant values you're worried about come in long 
stretches (e.g., temperature is the same for many observations in a 
row), I suppose you could do the same thing - map a constant value to 
the range of observation IDs for which it holds.  This gets back to 
having many tables, though.

This is in effect normalization? But if the observation ID takes just as 
much storage as the original value, have we gained anything? BTW, I'm 
not aiming at redundancy in the measurements - this is minimal compared 
to e.g. year and station ID. I do have an idea of how it *should* be 
possible to get rid of much of it - read the response to Tom Lane below.

##############################

Tom Lane wrote:

No, tableoid is sort of a virtual column ... it doesn't exist on disk.
When you query it you get a value fetched from the internal data
structure representing the table.

So virtual columns are possible - THIS is a way to clear redundant data! 
Is it possible for a user to create a virtual column? If not, this would 
make a big improvement.

What I really need are "partial virtual columns". I'm imagining an 
alternative version of VACUUM ANALYZE that could do the following:
1) Order the rows in the table so that for each column, identical values 
are placed next to each other for as far as possible (the row order that 
optimizes one column will probably not be optimal for other columns).
2) For each column, identify the stretches that contain only one 
distinct value. Save that value together with ID of start and end row 
and delete stretch.
It is not obvious how to do a perfect optimization process in 1), at 
least not to me - I'm sure a skilled mathematician would know exactly 
how to do it. But here's a simple approach that would get us part of the 
way:
1.1) Grab the column w. most redundancy (fewest distinct values) and 
sort it into groups according to the distinct values.
1.2) For each of these groups, grab the column w. next most redundancy 
and sort into groups according to the distinct values.
And so on. Stop whenever groups become so small that there's nothing to 
gain.
Such an analysis would make it much less expensive to combine 
same-schema tables, and having everything in the same table is really 
convenient. It would obviously save a lot of storage space, but I 
imagine it would enable more efficient queries too - having to check 
just 3 values instead of the thousands (or even millions) they may 
replace must give a considerable gain.

'What is the big benefit of not having ordered rows? I imagine it could 
be a disadvantage for dynamic databases, but for a static database like 
mine which won't be modified, except for maybe adding new data once a 
year, I imagine an optimization including row ordering could be highly 
beneficial.

##############################

Jim C. Nasby wrote:

What you seem to be looking for is a form of partitioning. PostgreSQL
doesn't currently support partitioning of this form, but there's work in
progress to change that.

Any idea how far out in the future this is? Would it make the 
optimization process described above (reply to Tom Lane) obsolete? Well, 
maybe my ideas about an optimal solution just illustrate lack of 
knowledge about SQL, but I'm hoping somebody can see what I'm trying to 
suggest.

As someone else mentioned, you could do it with a union all view.
http://cvs.distributed.net/viewcvs.cgi/stats-sql/logdb/ has an example
of this.

Thank you - it does look as if some union all views could come in handy.

##############################

Thanks and best regards,

Poul Jensen

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

              http://archives.postgresql.org