On Tue, Jul 5, 2011 at 12:48 AM, Daniel Farina <daniel@xxxxxxxxxx> wrote: > This is basically exactly the same as > http://archives.postgresql.org/pgsql-sql/2008-10/msg00009.php; I'm > just asking again, to see if thinking on the problem has changed: > > The basic problem, restated, is one has a relation with tuples like this: > > (key, recency_stamp, other_columns*) > > Some example data may be: > > 'ice cream', 0, 'vanilla' > 'ice cream', 1, 'chocolate' > 'beer', 0, 'ale' > 'beer', 3, 'lager' > > The goal is to extract from this relation the following: > > 'ice cream', '1', 'chocolate' > 'beer', 3, 'lager' > > In my dataset, the distinct number of keys is << the number of rows; > in fact, that ratio grows ever smaller until the data is truncated, > with a convergence point probably resting at a negligible fraction of > a percent -- definitely less than 0.1% in short order, easily getting > to 0.01% and 0.001% in not a vast amount of time. There is a standard > btree index on (key, recency_stamp), which ideally is exploitable to > compute the 'most recent' relation very quickly. Approaches I've > investigated include DISTINCT ON (this may be the cleanest), windowing > functions (rank, and a predicate), and a GROUP BY with a self-join, > but none of them avoid a full relation scan, which is just crippling > performance-wise. > > The only way I could find to trick the optimizer into doing something > non-crippling was to trick it with > http://wiki.postgresql.org/wiki/Loose_indexscan to find distinct > values quickly and subsequently using a subquery in the target list > that the optimizer decides to run over every distinct value (it > estimates a very high cost for this, even though it is actually rather > fast, but the ridiculous costing can subsequent joins against the > relation in unfortunate ways). This mechanism is obscure enough that > I may just write a plpgsql function as a workaround, as that may well > be more lucid. I think its a pretty common requirement and we should be looking to optimize it if it isn't handled well. The only problem is that there is a few ways of writing that in SQL and we would probably need to recognise how to transform between query types to achieve a common form. For example, the post you cite uses a correlated subquery whereas its possible to write it using an IN subselect. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general