optimizing for temporal data behind a view

Richard Henwood <richard.henwood@xxxxxxxxxx> · Thu, 17 Sep 2009 13:55:07 +0100

Hi All,

I have a large quantity of temporal data, 6 billion rows, which I would
like to put into a table so I can exploit SQL datetime queries. Each row
represents a geophysical observation at a particular time and place. The
data is effectively read-only - i.e. very infrequent updates will be
performed. The rows are very 'narrow' (~24bytes of data per row).

When I ingest each data into PostgreSQL a row at a time I discovered
that the row over-head is significant (pg 8.3.7). The projected
resources required to host this table prohibit this simple approach.

In order to reduce the cost of the row over head, I tried storing a
whole minutes worth of data in an array, and now I only require one row
per minute. Total rows decreased by 60, resources required became
realistic.

My schema is thus:

CREATE TABLE geodata1sec (obstime TIMESTAMP WITHOUT TIME ZONE NOT NULL,
statid SMALLINT NOT NULL, geovalue_array REAL[3][60] NOT NULL);
and after ingesting, I add these indexes:
ALTER TABLE geodata1sec ADD PRIMARY KEY (obstime, statid);
CREATE INDEX geodata1sec_statid_idx ON geodata1sec (statid);

Storing whole minutes in a row with the data in an array has the desired
effect of making the table size on disk, and index size in memory,
manageable. However, my queries now need to be sensitive that I've made
this schema design decision. The following query runs nice and quick but
obviously doesn't return all the relevant results (because second
resolution is specified):

EXPLAIN ANALYZE SELECT * FROM geodata1sec WHERE obstime BETWEEN
'2004-10-21 02:03:04' AND '2004-10-21 02:04:08';
QUERY
PLAN                                                                      
--------------------------------------------------------------------------
 Index Scan using geodata1sec_pkey on geodata1sec  (cost=0.00..38.19
rows=12 width=762) (actual time=0.071..0.148 rows=13 loops=1)
   Index Cond: ((obstime >= '2004-10-21 02:03:04'::timestamp without
time zone) AND (obstime <= '2004-10-21 02:04:08'::timestamp without time
zone))
Total runtime: 0.292 ms
(3 rows)

... So, I constructed a view which would present my data as I originally
intended. This also means that I don't have to give my applications
detailed knowledge of the schema. The view is:

CREATE VIEW geodataview AS SELECT obstime + (s.a*5 || '
seconds')::INTERVAL AS obstime, statid, geovalue_array[s.a+1][1] AS
x_mag, geovalue_array[s.a+1][2] AS y_mag, geovalue_array[s.a+1][3] AS
z_mag FROM generate_series(0, 11) AS s(a), geodata1sec;

So my query returns _all_ the relevant data. However, this query takes a
long time. If I analyse the query I get: 

EXPLAIN ANALYZE SELECT * FROM geodataview WHERE obstime BETWEEN
'2004-10-21 02:03:04' AND '2004-10-21 02:04:08';
QUERY PLAN 
--------------------------------------------------------------------------
 Nested Loop  (cost=13.50..2314276295.50 rows=4088000000 width=766)
(actual time=2072612.668..3081010.104 rows=169 loops=1)
   Join Filter: (((geodata1sec.obstime + ((((s.a * 5))::text || '
seconds'::text))::interval) >= '2004-10-21 02:03:04'::timestamp without
time zone) AND ((geodata1sec.obstime + ((((s.a * 5))::text || '
seconds'::text))::interval) <= '2004-10-21 02:04:08'::timestamp without
time zone))
   ->  Seq Scan on geodata1sec  (cost=0.00..4556282.00 rows=36792000
width=762) (actual time=17.072..414620.213 rows=36791999 loops=1)
   ->  Materialize  (cost=13.50..23.50 rows=1000 width=4) (actual
time=0.002..0.027 rows=12 loops=36791999)
       ->  Function Scan on generate_series s  (cost=0.00..12.50
rows=1000 width=4) (actual time=0.075..0.102 rows=12 loops=1)
Total runtime: 3081010.613 ms
(6 rows)

This is clearly not going to perform for any practical applications.
However, it struck me that others might have needed similar
functionality for time data so I thought I would air my experience here.

Is it feasible to modify the query planner to make better decisions when
dealing with time data behind a view?

Are there any alternatives to vanilla Postgresql for storing this type
of data? I'm imagining PostGIS but for time based data?

Your time and thoughts are appreciated,
Cheers,
Richard
-- 
Scanned by iCritical.

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance