Search Postgresql Archives

Re: SQL - planet redundant data

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hmm, in fact if the redundant values you're worried about come in long stretches (e.g., temperature is the same for many observations in a row), I suppose you could do the same thing - map a constant value to the range of observation IDs for which it holds. This gets back to having many tables, though.

This is in effect normalization? But if the observation ID takes just as much storage as the original value, have we gained anything? BTW, I'm not aiming at redundancy in the measurements - this is minimal compared to e.g. year and station ID.

I don't think this is exactly what people meant by normalization, but maybe. My basic thought was that you since you have some redundancy in your data, you might want to use some form of compression. One of the simplest forms of compression is called run-length encoding (http://en.wikipedia.org/wiki/Run_length_encoding). So you'd have most of your data in a main table:

  create table observations (
    obsID			integer	primary key,    -- Maybe a BIGINT
    temperature	float,
    etc.
  );

and some other "compressed" tables for those features that have long runs of repetitive values:

  create table obsYears (
	startObs	integer	primary key	references observations (obsID),
	endObs	integer				references observations (obsID),
	year		integer);

  create table obsStations (
	startObs	integer	primary key	references observations (obsID),
	endObs	integer				references observations (obsID),
	stationID	integer);

(Caution, I haven't checked these for syntax.) I've introduced an observation ID, and then I have "compressed" tables that map =ranges= of these IDs to values that are constant for long stretches. Each year occupies only one row, same with each station. (I think your reply to Tom may have been getting at something like this.) Now you can do queries like this, say, for temperature statistics in a particular year:

select avg(temperature), stddev(temperature) from observations, obsYears
    where obsID between startObs and endObs
    and year = 2001;

You could join in other compressed tables in the same way. In fact, you could glue them all together with a VIEW, and you'd be able to treat the whole thing like one giant table, with much of the redundancy removed. Note that if you define indexes on the startObs and endObs columns, Postgresql might avoid scanning through the compressed tables every time you do a query. You might also benefit from a composite index on (startObs, endObs). For features like year, which are massively repetitive, this might even be faster than storing the feature in the main table, since the compressed table will easily fit in memory.

So the basic idea is run-length encoding for repetitive values. I think I can use this in some of my own data - I don't know why I never thought of it before.

- John D. Burger
  MITRE



---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux