Re: SELECTing every Nth record for better performance

Greg Smith <greg@xxxxxxxxxxxxxxx> · Fri, 04 Dec 2009 05:41:54 -0500

Tom wrote:
1. taking the primary key (which is an auto increasing integer) and
running modulo on it. This will not work in my case as the database is
being used for a number of different logging 'runs', and these runs
are not necessarily equally distributed over the primary keys.

Problem #1 with what you're trying to do is that it's tricky to get SQL 
to have a notion of "row number" in a result set, so that you can then 
filter on that number.  The best approach to this is to use PostgreSQL 
8.4 where the SQL Window functions can be used for this purpose.

2. Subqueries that do SELECT COUNT to generate a row number are too
slow as im dealing with thousands to tens of thousands of records.

Problem #2 is that if you're looking at a only a mildly filtered version 
of your data, you're going to pull the whole set in anyway.  Random data 
point in this area:  if you do a scan on a table that needs to look at 
20% of a table using an index, what will happen when you execute it?  
The correct answer is likely "sequential scan of the entire table", 
because that's actually more efficient than trying to grab only a few 
records once the percentage gets large enough.  The way multiple records 
get packed onto a single page, you're likely to actually read every page 
of the data anyway even when trying to grab a subset of them, unless the 
subset is very small relative to the data and you can traverse an index 
usefully.

Once the "n" in your zoom gets large enough, it's possible to make this 
worthwhile.  Unless your records are really wide, I would guess that it 
would take a 1000:1 compression or more before you'd end up with a query 
that's truly shorter than scanning the whole set.  And that wouldn't 
work like what you're trying to do right now at all:  you'd instead have 
to know the bounds of the data set, generate a sequence of points from 
within that set, and then grab the records best matching those to get a 
useful zoomed-out subset.  Basically, determine where the sequence of 
records you need should be, then go into the data set to find just them 
using something like "WHERE ts>x LIMIT 1"; that's the only way to not 
scan the whole thing.

I think this whole approach isn't likely to ever converge on what you 
want.  The direction I think you should be going is to consider whether 
it's possible to create materialized views of your data that summarize 
it at wider time scales.  You can't compute such a thing in real-time 
usefully without reading the whole data set, and once you realize that 
you might as well figure out how to only compute the summarized version 
once.  The last comment in this thread as I write this, from Grzegorz, 
suggests one approach for something like that.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@xxxxxxxxxxxxxxx  www.2ndQuadrant.com

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general