Re: optimizing for temporal data behind a view

Hrishikesh (हृषीकेश मेहेंदळे) <hashinclude@xxxxxxxxx> · Thu, 17 Sep 2009 07:54:39 -0700

Hi Richard,

> CREATE VIEW geodataview AS SELECT obstime + (s.a*5 || '
> seconds')::INTERVAL AS obstime, statid, geovalue_array[s.a+1][1] AS
> x_mag, geovalue_array[s.a+1][2] AS y_mag, geovalue_array[s.a+1][3] AS
> z_mag FROM generate_series(0, 11) AS s(a), geodata1sec;

To my (admittedly untrained) eye, it seems that the JOIN that will
implicitly happen (generate_series(0,11) and geodata1sec) will be over
all records in geodata1sec, and the explain analyze of the view you
posted seems to corroborate that. (I suspect that the JOIN also kills
the time filter for geodata1sec, which would worsen things.)

> EXPLAIN ANALYZE SELECT * FROM geodataview WHERE obstime BETWEEN
> '2004-10-21 02:03:04' AND '2004-10-21 02:04:08';

>  Nested Loop  (cost=13.50..2314276295.50 rows=4088000000 width=766)
> (actual time=2072612.668..3081010.104 rows=169 loops=1)
>   Join Filter: (((geodata1sec.obstime + ((((s.a * 5))::text || '
> seconds'::text))::interval) >= '2004-10-21 02:03:04'::timestamp without
> time zone) AND ((geodata1sec.obstime + ((((s.a * 5))::text || '
> seconds'::text))::interval) <= '2004-10-21 02:04:08'::timestamp without
> time zone))
>   ->  Seq Scan on geodata1sec  (cost=0.00..4556282.00 rows=36792000
> width=762) (actual time=17.072..414620.213 rows=36791999 loops=1)

The seqscan should return only 12 rows (as per your original explain
analyze output), but actually returns 37 million.

> This is clearly not going to perform for any practical applications.
> However, it struck me that others might have needed similar
> functionality for time data so I thought I would air my experience here.
>
> Is it feasible to modify the query planner to make better decisions when
> dealing with time data behind a view?

You could use table partitioning and split your geodata1sec table into
(say) one table per hour, which can then hold a lot fewer records to
JOIN with. (with PG 8.3.7 you need to explicitly enable
constraint_exclusion in the config file for this to work).

You could change the view to be a stored proc instead, but I'm
guessing you don't want to (or cannot) change the application which
makes the query.

You could also change the view to call a stored procedure that does, in essence,
for i in (0..11); do { query geodata1sec for t+i; } and return the
resulting recordset, which might be faster.

If you're dealing with mostly invariant-after-insert data, you can use
partitioning then CLUSTER any tables that won't be touched on an
appropriate column so the seqscan (if there is one) is faster, and
vacuum analyze the table once it's clustered.

> Are there any alternatives to vanilla Postgresql for storing this type
> of data? I'm imagining PostGIS but for time based data?

I recently had to deal with something similar (though not on your
scale) for network monitoring - the thread is available at
  http://archives.postgresql.org/pgsql-performance/2009-08/msg00275.php

Cheers,
Hrishi

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance