Re: Optimizing queries that use multiple tables and many order by columns

"Wappler, Robert" <rwappler@xxxxxxxxxxx> · Thu, 26 Aug 2010 09:51:38 +0200

On 2010-08-25, Joshua Berry wrote:

> --Here's what explain analyze says for the query
> explain analyze
> declare "SQL_CUR0453D910" cursor with hold for
> select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
> JOB.CompanyCode, Anl.SampleName
> from analysis anl join job on anl.job = job.job
> order by job.companycode, anl.job, anl.lab;
> 
> Sort  (cost=38047.92..38495.65 rows=179095 width=32) (actual
> time=1890.796..2271.248 rows=178979 loops=1)
>   Sort Key: job.companycode, anl.job, anl.lab
>   Sort Method:  external merge  Disk: 8416kB
>   ->  Hash Join  (cost=451.20..18134.05 rows=179095 width=32)
> (actual time=8.239..260.848 rows=178979 loops=1)
>         Hash Cond: (anl.job = job.job) ->  Seq Scan on analysis anl 
>         (cost=0.00..14100.95 rows=179095 width=23) (actual
>         time=0.026..91.602 rows=178979 loops=1) ->  Hash 
>         (cost=287.20..287.20 rows=13120 width=17)
> (actual time=8.197..8.197 rows=13120 loops=1)
>               ->  Seq Scan on job  (cost=0.00..287.20
> rows=13120 width=17) (actual time=0.007..4.166 rows=13120 loops=1)
> Total runtime: 2286.224 ms
> 
> 
> 
> 	Maybe, the planner decides for a Sort Join, if there
> are sorted indexes
> 	
> 	for anl.job and job.job. But the speed-up may vary
> depending on the
> 	data.
> 	
> 
> 
> It seems to be reading the entire dataset, then sorting,
> right? There's not much more that could be done to improve
> such queries, aside from increasing memory and IO bandwidth.
> 

It has to, because it has to start with the smallest row in the
resultset, which may be the last one read, if it cannot read the tuples
in sorted order.

> But now that I've said that, there's the following query that
> deals with exactly the same set of data, but the ordering
> involves only one of the two joined tables.
> 
> explain analyze
> declare "SQL_CUR0453D910" cursor with hold for
> select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
> JOB.CompanyCode, Anl.SampleName
> from analysis anl join job on anl.job = job.job
> order by job.companycode --, anl.job, anl.lab; --Only order
> by indexed columns from job.
> 
> Nested Loop  (cost=0.00..65305.66 rows=179095 width=32)
> (actual time=0.084..288.976 rows=178979 loops=1)
>   ->  Index Scan using job_companycode on job
> (cost=0.00..972.67 rows=13120 width=17) (actual
> time=0.045..7.328 rows=13120 loops=1)
>   ->  Index Scan using analysis_job_lab on analysis anl
> (cost=0.00..4.63 rows=22 width=23) (actual time=0.006..0.015
> rows=14 loops=13120)
>         Index Cond: (anl.job = job.job)
> Total runtime: 303.230 ms
> 
> If I order by columns from the other table, analysis only, I
> get the follow query and results:
> explain analyze
> declare "SQL_CUR0453D910" cursor with hold for
> select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
> JOB.CompanyCode, Anl.SampleName
> from analysis anl join job on anl.job = job.job
> order by --job.companycode,
> anl.job, anl.lab; --Only order by indexed columns from analysis.
> 
> Merge Join  (cost=0.56..44872.45 rows=179095 width=32)
> (actual time=0.078..368.620 rows=178979 loops=1)
>   Merge Cond: (anl.job = job.job)
>   ->  Index Scan using analysis_job_lab on analysis anl
> (cost=0.00..35245.47 rows=179095 width=23) (actual
> time=0.035..128.460 rows=178979 loops=1)
>   ->  Index Scan using job_job_pk on job  (cost=0.00..508.53
> rows=13120 width=17) (actual time=0.039..53.733 rows=179005 loops=1)
> Total runtime: 388.884 ms
> 
> 
> Notice that in these cases the query completes in <400 ms and
> the other query that involves ordering on columns from both
> of the joined tables completes in >2300ms.
>

Because, these queries don't need to sort the result, they can read it
in order.

What I don't really get is, that you compare queries with different sort
orders, especially with a different number of sort keys. Of course, that
has a big influence on the time needed for sorting. While your first
query, which sorts on three keys really does a sort on the result, the
latter two don't need that, because they can read the tuples in the
correct order from the indexes. If I read the first plan correctly, that
sort costs you about 2 sec in query execution time, because the Hash
Join is done after 260ms.

Do you really have the requirement to sort anything? Or let me ask it
the other way round: Assuming you have too much data, to sort it on the
application side, which user can read all this from one single table in
the user interface?

> In the application here, these queries are used by a client
> application to fill a window's listbox that can be scrolled
> up or down. If the user changes direction of the scroll, it
> initiates a new cursor and query to fetch a page of results.
> If the scrolling motion is in the same direction, it simply
> continues to fetch more results from the cursor. But each
> time the direction of movement changes, there can be a
> significant lag.
> 

Then, obviously you shouldn't create a new cursor. You can create
backwards scrollable cursors. See the SCROLL option of the DECLARE
statement.

> Any suggestions would be helpful! I'll assume for now that
> the indexes and queries can't be improved, but rather that I
> should tweak more of the postmaster settings. Please correct
> me if you know better and have time to reply.
>

These options heavily depend on the environment and the data set, I
always see them as some last resort, because they might slow down other
queries if tweaked to much towards a specific thing. I have not yet
played around with this a lot. The things simply work fast enough here.
Others can give you better hints on this.

> Thanks,
> -Joshua
> 
> P.S. Is it possible to have indexes that involves several
> columns from different but related tables? If so, where can I
> learn about them?
> 
> 

Nope. An index is tied to one table only. But another option is, to
precalculate the join. Depending on your needs (especially INSERT/UPDATE
performance), you could use triggers and/or a regular batch job, which
writes the joined results in another table. There you can index these
columns accordingly. In general, this is ugly and leads to redundancy
but can give a big performance boost and is sometimes the only option.

-- 
Robert...

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general