Search Postgresql Archives

Re: SELECTing every Nth record for better performance

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Tom wrote:
1. taking the primary key (which is an auto increasing integer) and
running modulo on it. This will not work in my case as the database is
being used for a number of different logging 'runs', and these runs
are not necessarily equally distributed over the primary keys.
Problem #1 with what you're trying to do is that it's tricky to get SQL to have a notion of "row number" in a result set, so that you can then filter on that number. The best approach to this is to use PostgreSQL 8.4 where the SQL Window functions can be used for this purpose.

2. Subqueries that do SELECT COUNT to generate a row number are too
slow as im dealing with thousands to tens of thousands of records.
Problem #2 is that if you're looking at a only a mildly filtered version of your data, you're going to pull the whole set in anyway. Random data point in this area: if you do a scan on a table that needs to look at 20% of a table using an index, what will happen when you execute it? The correct answer is likely "sequential scan of the entire table", because that's actually more efficient than trying to grab only a few records once the percentage gets large enough. The way multiple records get packed onto a single page, you're likely to actually read every page of the data anyway even when trying to grab a subset of them, unless the subset is very small relative to the data and you can traverse an index usefully.

Once the "n" in your zoom gets large enough, it's possible to make this worthwhile. Unless your records are really wide, I would guess that it would take a 1000:1 compression or more before you'd end up with a query that's truly shorter than scanning the whole set. And that wouldn't work like what you're trying to do right now at all: you'd instead have to know the bounds of the data set, generate a sequence of points from within that set, and then grab the records best matching those to get a useful zoomed-out subset. Basically, determine where the sequence of records you need should be, then go into the data set to find just them using something like "WHERE ts>x LIMIT 1"; that's the only way to not scan the whole thing.

I think this whole approach isn't likely to ever converge on what you want. The direction I think you should be going is to consider whether it's possible to create materialized views of your data that summarize it at wider time scales. You can't compute such a thing in real-time usefully without reading the whole data set, and once you realize that you might as well figure out how to only compute the summarized version once. The last comment in this thread as I write this, from Grzegorz, suggests one approach for something like that.

--
Greg Smith    2ndQuadrant   Baltimore, MD
PostgreSQL Training, Services and Support
greg@xxxxxxxxxxxxxxx  www.2ndQuadrant.com


--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux