Search Postgresql Archives

Re: SQL - planet redundant data

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Thank you for your input! Individual replies follow below.

##############################

Chris Travers wrote:

Ok.  Imagine two huge huge tables:

file_data    and additional_data

create_table file_data (
   file_id serial primary key,
   station_id text,
    ....
);

create table additional_data (
   data_id bigserial primary key,
   file_id int REFERENCES file_data (file_id),
   temp,
   ....
);

So the 500 million entries in additional_data.file_id are now restricted to belong to the same group of values found in file_data.file_id. But I don't see how this can get me rid of redundant data - I still have 500M entries? Consider the column "year". My 500M records come from 13 years. Saving 500M values when only 13 are distinct is very redundant.


Also you can normalize your way out of the redundant data problem.

I can tell how to do this for filename and station name - I save something by saving an integer rather than a text string. But all the rest of my columns are already smallints. Do I save anything by storing "5" rather than "1997"?


I would *highly* recommend waiting until 8.1 goes live to impliment this in production.

It will be built/rebuilt after 8.1 is released.

##############################

John D. Burger wrote:

Why not effectively append all of your per-file tables into one huge table, with an additional column indicating which file the data comes from? Then you have only two tables, one with a row for each observation in your data, one with a row for each file. Some queries need to join the two tables, but that's not a big deal.

That big table with a row for each observation will have to include all columns, such as year and station_id. For such columns I store 500M values of which only ~15 are distinct in my case - the redundancy I'm trying to minimize. By splitting into smaller tables, many columns will contain only one distinct value. It *should* be possible to delete such columns and instead store their corresponding values in a "table header". If I had one table pr. file, the table you suggest with one row pr. file could effectively serve as the collection of table headers.


It also seems to me that you may be tying your schema design too closely to the current way that the data is represented. Do you really need to have the data file figure so prominently in your design?

The archive of data files is shared by a number of research groups around the world. There's a whole software package that people use for data analysis, accessing the data files as they are. So I expect a lot of file-specific queries.

Hmm, in fact if the redundant values you're worried about come in long stretches (e.g., temperature is the same for many observations in a row), I suppose you could do the same thing - map a constant value to the range of observation IDs for which it holds. This gets back to having many tables, though.

This is in effect normalization? But if the observation ID takes just as much storage as the original value, have we gained anything? BTW, I'm not aiming at redundancy in the measurements - this is minimal compared to e.g. year and station ID. I do have an idea of how it *should* be possible to get rid of much of it - read the response to Tom Lane below.

##############################

Tom Lane wrote:

No, tableoid is sort of a virtual column ... it doesn't exist on disk.
When you query it you get a value fetched from the internal data
structure representing the table.
So virtual columns are possible - THIS is a way to clear redundant data! Is it possible for a user to create a virtual column? If not, this would make a big improvement.

What I really need are "partial virtual columns". I'm imagining an alternative version of VACUUM ANALYZE that could do the following: 1) Order the rows in the table so that for each column, identical values are placed next to each other for as far as possible (the row order that optimizes one column will probably not be optimal for other columns). 2) For each column, identify the stretches that contain only one distinct value. Save that value together with ID of start and end row and delete stretch. It is not obvious how to do a perfect optimization process in 1), at least not to me - I'm sure a skilled mathematician would know exactly how to do it. But here's a simple approach that would get us part of the way: 1.1) Grab the column w. most redundancy (fewest distinct values) and sort it into groups according to the distinct values. 1.2) For each of these groups, grab the column w. next most redundancy and sort into groups according to the distinct values. And so on. Stop whenever groups become so small that there's nothing to gain. Such an analysis would make it much less expensive to combine same-schema tables, and having everything in the same table is really convenient. It would obviously save a lot of storage space, but I imagine it would enable more efficient queries too - having to check just 3 values instead of the thousands (or even millions) they may replace must give a considerable gain.

'What is the big benefit of not having ordered rows? I imagine it could be a disadvantage for dynamic databases, but for a static database like mine which won't be modified, except for maybe adding new data once a year, I imagine an optimization including row ordering could be highly beneficial.

##############################

Jim C. Nasby wrote:

What you seem to be looking for is a form of partitioning. PostgreSQL
doesn't currently support partitioning of this form, but there's work in
progress to change that.
Any idea how far out in the future this is? Would it make the optimization process described above (reply to Tom Lane) obsolete? Well, maybe my ideas about an optimal solution just illustrate lack of knowledge about SQL, but I'm hoping somebody can see what I'm trying to suggest.

As someone else mentioned, you could do it with a union all view.
http://cvs.distributed.net/viewcvs.cgi/stats-sql/logdb/ has an example
of this.
Thank you - it does look as if some union all views could come in handy.

##############################

Thanks and best regards,

Poul Jensen

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

              http://archives.postgresql.org

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux