Re: Slow query: select * order by XXX desc offset 10 limit 10

Alban Hertroys <haramrae@xxxxxxxxx> · Fri, 14 Oct 2011 21:29:58 +0200

On 14 Oct 2011, at 13:58, Alexander Farber wrote:

> Hi Alban and others -
> 
> On Fri, Oct 14, 2011 at 1:34 PM, Alban Hertroys <haramrae@xxxxxxxxx> wrote:
>> Anyway, I think you get the sequential scans because the UNION requires to sort all the data from both tables to guarantee that the results are unique (hence that long Sort Key at the 7th line of explain output).
>> For that reason, an index on qdatetime alone won't help much, especially when most of your data has qdatetime <= now(), which is probably the case.
>> 
>> It doesn't matter that you only want 10 results from that set, the database will first have to figure out which those rows are. That gets more complicated because they can come from two different tables, due to the UNION.
>> 
>> Do you really need unique results from that view, or are duplicates acceptable (one from each table)? In that case, try UNION ALL instead of UNION.
> 
> I don't need unique at all!
> 
> So I've run "explain analyse" on the old view:
> 
> quincy=> explain analyse select to_char(qdatetime, 'YYYY-MM-DD') as
> QDATETIME,ID,NAME,CATEGORY,APPSVERSION,OSVERSION,DETAILS,DEVINFO from
> quincyview where qdatetime <= now() order by QDATETIME desc offset 10
> limit 10;
> 
> 
>      QUERY PLAN
> 
> 
> ----------------------------------------------------------------------------------------------------------------------
> ----------------------------------------------------------------------------------------------------------------------
> ----------------------------------------------------------------------------------------------------------------------
> ----------------------------------------------
> Limit  (cost=559200.14..559200.16 rows=10 width=1172) (actual
> time=11311.537..11311.541 rows=10 loops=1)
>   ->  Sort  (cost=559200.11..561534.85 rows=933894 width=1172)
> (actual time=11311.532..11311.536 rows=20 loops=1)
>         Sort Key: (to_char(quincyview.qdatetime, 'YYYY-MM-DD'::text))
>         Sort Method:  top-N heapsort  Memory: 27kB
>         ->  Subquery Scan quincyview  (cost=482985.36..534349.53
> rows=933894 width=1172) (actual time=5778.592..9004.
> 663 rows=934084 loops=1)
>               ->  Unique  (cost=482985.36..522675.85 rows=933894
> width=254) (actual time=5777.972..7320.816 rows=9340
> 84 loops=1)
>                     ->  Sort  (cost=482985.36..485320.09 rows=933894
> width=254) (actual time=5777.969..6557.012 rows=
> 934084 loops=1)
>                           Sort Key: quincynoreset.qdatetime,
> quincynoreset.id, quincynoreset.name, quincynoreset.cate
> gory, quincynoreset.appsversion, quincynoreset.osversion,
> quincynoreset.beta_prog, quincynoreset.catinfo, quincynorese
> t.details, quincynoreset.devinfo, quincynoreset.email,
> quincynoreset.emailid, quincynoreset.imei, quincynoreset.pin, q
> uincynoreset.formfactor, quincynoreset.copied
>                           Sort Method:  external merge  Disk: 180992kB
>                           ->  Append  (cost=0.00..55205.73
> rows=933894 width=254) (actual time=11.592..2242.501 rows=
> 934084 loops=1)
>                                 ->  Seq Scan on quincynoreset
> (cost=0.00..39188.71 rows=808567 width=244) (actual ti
> me=11.591..1739.695 rows=808647 loops=1)
>                                       Filter: (qdatetime <= now())
>                                 ->  Seq Scan on quincytrack
> (cost=0.00..6678.09 rows=125327 width=315) (actual time=
> 6.801..298.642 rows=125437 loops=1)
>                                       Filter: (qdatetime <= now())
> Total runtime: 11363.393 ms
> (15 rows)

You can paste those in http://explain.depesz.com/ to get an analysis of where the most time gets spent in your query. That's often a bit quicker at pointing you to the sore points.

It shows that much of the time in the query is spent on those sorts, and we just eliminated the worst offender ;)

> Then I'v dropped and recreated the view with "union all":

> Then "explain analyse" on the same select query again:
> 
> quincy=> explain analyse select to_char(qdatetime, 'YYYY-MM-DD') as
> QDATETIME,ID,NAME,CATEGORY,APPSVERSION,OSVERSION,DETAILS,DEVINFO from
> quincyview where qdatetime <= now() order by QDATETIME desc offset 10
> limit 10;
>                                                                   QUERY PLAN
> 
> ----------------------------------------------------------------------------------------------------------------------
> --------------------------
> Limit  (cost=73052.13..73052.16 rows=10 width=111) (actual
> time=3782.645..3782.649 rows=10 loops=1)
>   ->  Sort  (cost=73052.11..75386.84 rows=933894 width=111) (actual
> time=3782.640..3782.643 rows=20 loops=1)
>         Sort Key: (to_char(quincynoreset.qdatetime, 'YYYY-MM-DD'::text))
>         Sort Method:  top-N heapsort  Memory: 27kB
>         ->  Result  (cost=0.00..48201.53 rows=933894 width=111)
> (actual time=0.039..2660.561 rows=934084 loops=1)
>               ->  Append  (cost=0.00..45866.79 rows=933894 width=111)
> (actual time=0.021..1239.916 rows=934084 loops=
> 1)
>                     ->  Seq Scan on quincynoreset
> (cost=0.00..39188.71 rows=808567 width=95) (actual time=0.020..916
> .249 rows=808647 loops=1)
>                           Filter: (qdatetime <= now())
>                     ->  Seq Scan on quincytrack  (cost=0.00..6678.09
> rows=125327 width=215) (actual time=0.030..125.6
> 49 rows=125437 loops=1)
>                           Filter: (qdatetime <= now())
> Total runtime: 3782.759 ms
> (11 rows)
> 
> Now the script is noticably more enjoyable, thank you!

This time it's spending a large portion of it's time sorting on that to_char function. As Bill mentioned, the qdatetime in the SELECT list causes that the qdatetime in the ORDER BY uses the "updated definition" from your SELECT list. It doesn't need to do that, the actual timestamp is just as good at that and on that column you have an index!

> Do I still need to add indices over the whole union
> and what's the syntax please?

Nope, you only needed that because the query was sorting on all those columns.

For the record, an index like that is called a multi-column index and the definition would be:
CREATE INDEX quincynoreset_full_idx ON quincynoreset (qdatetime, id, name, category, appsversion, osversion, beta_prog, catinfo, details, devinfo, email, emailid, imei, pin, formfactor, copied);

(Don't pay too much attention to the order of columns there, I just took the columns and their respective order from the earlier query plan)

> I'm also thinking about adding some "pipelining"
> (i.e. prefetching 5-10 pages for the HTML-table):
> http://datatables.net/release-datatables/examples/server_side/pipeline.html

Firing more selective queries at the database would help as well.

It looks like you're implementing some kind of paging through the result set, but do you really think people will want to page through 100,000 pages of results?

I think they'd be much happier if you would rank the results somehow, so that they get what they're looking for relatively quickly.
Or you could divide up the information - if they have some idea of when the event they're looking for occurred, that helps narrow down the data set a lot and actually increases their chances of finding it.
Those are just a few examples, it much depends on the data and the users you're working with.

Alban Hertroys

--
The scale of a problem often equals the size of an ego.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general