Re: Finding latest record for a number of groups in an INSERT-only table

Gavin Flower <GavinFlower@xxxxxxxxxxxxxxxxx> · Tue, 05 Jul 2011 21:45:44 +1200

On 05/07/11 11:48, Daniel Farina wrote:
This is basically exactly the same as
http://archives.postgresql.org/pgsql-sql/2008-10/msg00009.php; I'm
just asking again, to see if thinking on the problem has changed:

The basic problem, restated, is one has a relation with tuples like this:

(key, recency_stamp, other_columns*)

Some example data may be:

'ice cream', 0, 'vanilla'
'ice cream', 1, 'chocolate'
'beer', 0, 'ale'
'beer', 3, 'lager'

The goal is to extract from this relation the following:

'ice cream', '1', 'chocolate'
'beer', 3, 'lager'

In my dataset, the distinct number of keys is<<  the number of rows;
in fact, that ratio grows ever smaller until the data is truncated,
with a convergence point probably resting at a negligible fraction of
a percent -- definitely less than 0.1% in short order, easily getting
to 0.01% and 0.001% in not a vast amount of time.  There is a standard
btree index on (key, recency_stamp), which ideally is exploitable to
compute the 'most recent' relation very quickly.  Approaches I've
investigated include DISTINCT ON (this may be the cleanest), windowing
functions (rank, and a predicate), and a GROUP BY with a self-join,
but none of them avoid a full relation scan, which is just crippling
performance-wise.

The only way I could find to trick the optimizer into doing something
non-crippling was to trick it with
http://wiki.postgresql.org/wiki/Loose_indexscan to find distinct
values quickly and subsequently using a subquery in the target list
that the optimizer decides to run over every distinct value (it
estimates a very high cost for this, even though it is actually rather
fast, but the ridiculous costing can subsequent joins against the
relation in unfortunate ways).  This mechanism is obscure enough that
I may just write a plpgsql function as a workaround, as that may well
be more lucid.

I am using PostgreSQL 9.

Does anyone have fresh thoughts or suggestions for dealing with
INSERT-mostly tables conceived in this manner?

My solution, which is not especially optimised for an 'insert only' 
table, is:

DROP TABLE IF EXISTS T;

CREATE TABLE T
(
    type    text,
    id      int,
    flavour text,
    PRIMARY kEY (type, id)
);

INSERT INTO T (type, id, flavour) VALUES
    ('ice cream', 0, 'vanilla'),
    ('ice cream', 1, 'chocolate'),
    ('beer', 0, 'ale'),
    ('beer', 3, 'lager');

WITH
    M (type, id) AS
    (
        SELECT
            type,
            max(id)
        FROM
            T
        GROUP BY
            type
    )
SELECT
    T.type,
    T.id,
    T.flavour
FROM
    M,
    T
WHERE
    T.type = M.type AND
    T.id = M.id
;

Cheers,
Gavin

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general