Re: Window functions and index usage

Anssi Kääriäinen <anssi.kaariainen@xxxxxx> · Tue, 4 Oct 2011 17:06:45 +0300

On 10/04/2011 04:27 PM, Robert Klemme wrote:
On Tue, Oct 4, 2011 at 11:39 AM, Anssi Kääriäinen
<anssi.kaariainen@xxxxxx>  wrote:
I have the following setup:

create table test(id integer, seq integer);
insert into test select generate_series(0, 100), generate_series(0, 1000);
create unique index test_idx on test(id, seq);
analyze test;

Now I try to fetch the latest 5 values per id, ordered by seq from the
table:

select * from (
select id, seq, row_number() over (partition by id order by seq)
  from test
  where id in (1, 2, 3)
) where row_number()<= 5;
It seems this fetches the *first* 5 values per id - and not the
latest.  For that you would need to "order by seq desc" in the window
and probably also in the index.

Yeah. Sorry, wrong order. And the last line is wrong, it should be ") 
tmp where row_number <= 5;".
This does not use the index on id, seq for sorting the data. It uses a
bitmap index scan, and sequential scan when issued SET enable_bitmapscan to
false. Tested both on git head and 8.4.8. See end of post for plans. It
seems it would be possible to fetch the first values per id using the index,
or at least skip the sorting.
Just guessing: since row_number is an analytic function and it can be
combined with any type of windowing only in "rare" cases do the
ordering of index columns and the ordering in the window align.  So
while your particular use case could benefit from this optimization
the overall judgement might be that it's not worthwhile to make the
optimizer more complex to cover this case - or I fail to see the more
general pattern here. :-)

I think there are common use cases for this - see end of message for an 
example.
Is there some other way to run the query so that it would use the index? Is
there plans to support the index usage for the above query assuming that it
is possible to use the index for that query?

The real world use case would be to fetch latest 5 threads per discussion
forum in one query. Or fetch 3 latest comments for all patches in given
commit fest in single query.
Is it really that realistic that someone wants the latest n entries
for *all* threads / patches?  It seems since this can result in a very
large data set this is probably not the type of query which is done
often.
The idea is that the dataset isn't that large. And you don't have to 
fetch them for all threads / patches. You might fetch them only for 
patches in currently viewed commit fest. See 
https://commitfest.postgresql.org/action/commitfest_view?id=12 for one 
such use. What I have in mind is fetching first all the patches in the 
commit fest in one go. Then issue query which would look something like:
 select * from
    (select comment_data, row_number() over (partition by patch_id 
order by comment_date desc)
       from patch_comments
     where patch_id in (list of patch_ids fetched in first query)
   ) tmp where row_number <= 3;

Now you have all the data needed for the first column in the above 
mentioned page.

I guess the commit fest application is fetching all the comments for the 
patches in the commit fest in one query, and then doing the limit in 
application code. This is fine because there aren't that many comments 
per patch. But if you have dozens of forums and thousands of threads per 
forum you can't do that.

This is useful in any situation where you want to show n latest entries 
instead of the last entry in a list view. Latest modifications to an 
object, latest commits to a file, latest messages to a discussion thread 
or latest payments per project. Or 5 most popular videos per category, 
10 highest paid employees per department and so on.

 - Anssi Kääriäinen

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance