Re: Optimizing queries that use multiple tables and many order by columns

Joshua Berry <yoberi@xxxxxxxxx> · Wed, 25 Aug 2010 12:29:06 -0500

On Wed, Aug 25, 2010 at 10:40 AM, Wappler, Robert <rwappler@xxxxxxxxxxx> wrote:

On 2010-08-25, Joshua Berry wrote:

> Here's my latest culprit:

>

> select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,

> JOB.CompanyCode, Anl.SampleName

> from analysis anl join job on anl.job = job.job

> order by job.companycode, anl.job, anl.lab

> limit 10;

>

Could you try to remove the limit clause? I have seen several times,

that it may slow down a query. Although I haven't tested it, that an

OFFSET 0 clause can improve the situation, iirc.

The actual query uses a cursor, which seems to run the query and after the entire set is ready to be fetched, it will be able to allow fetching. So, I'm not using a limit in the actual queries, but something like this:

declare "SQL_CUR0453D910" cursor with hold for 
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode, anl.job, anl.lab;

fetch 10 in "SQL_CUR0453D910";
close "SQL_CUR0453D910";

>From an algebraic point of view, I cannot see obvious inefficiencies.

Others, which now the internals of pg better, might see more.

That are estimated query plans, what does EXPLAIN ANALYZE say? The query

plans above do not execute the query instead they just make a rough

guess about the costs. Reality might be different. Also you may want to

run VACUUM ANALYZE before.

The database is vacuum analyze'd and the stat target is the default of 100. I'm also using PG 8.4.4 running on Centos 5.5 x86_64

--Here's what explain analyze says for the query

explain analyze
declare "SQL_CUR0453D910" cursor with hold for 
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode, anl.job, anl.lab;

Sort  (cost=38047.92..38495.65 rows=179095 width=32) (actual time=1890.796..2271.248 rows=178979 loops=1)
  Sort Key: job.companycode, anl.job, anl.lab
  Sort Method:  external merge  Disk: 8416kB
  ->  Hash Join  (cost=451.20..18134.05 rows=179095 width=32) (actual time=8.239..260.848 rows=178979 loops=1)

        Hash Cond: (anl.job = job.job)
        ->  Seq Scan on analysis anl  (cost=0.00..14100.95 rows=179095 width=23) (actual time=0.026..91.602 rows=178979 loops=1)
        ->  Hash  (cost=287.20..287.20 rows=13120 width=17) (actual time=8.197..8.197 rows=13120 loops=1)

              ->  Seq Scan on job  (cost=0.00..287.20 rows=13120 width=17) (actual time=0.007..4.166 rows=13120 loops=1)
Total runtime: 2286.224 ms

Maybe, the planner decides for a Sort Join, if there are sorted indexes

for anl.job and job.job. But the speed-up may vary depending on the

data.

It seems to be reading the entire dataset, then sorting, right? There's not much more that could be done to improve such queries, aside from increasing memory and IO bandwidth.

But now that I've said that, there's the following query that deals with exactly the same set of data, but the ordering involves only one of the two joined tables.

explain analyze
declare "SQL_CUR0453D910" cursor with hold for 
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job

order by job.companycode --, anl.job, anl.lab; --Only order by indexed columns from job.

Nested Loop  (cost=0.00..65305.66 rows=179095 width=32) (actual time=0.084..288.976 rows=178979 loops=1)
  ->  Index Scan using job_companycode on job  (cost=0.00..972.67 rows=13120 width=17) (actual time=0.045..7.328 rows=13120 loops=1)

  ->  Index Scan using analysis_job_lab on analysis anl  (cost=0.00..4.63 rows=22 width=23) (actual time=0.006..0.015 rows=14 loops=13120)
        Index Cond: (anl.job = job.job)
Total runtime: 303.230 ms

If I order by columns from the other table, analysis only, I get the follow query and results:
explain analyze
declare "SQL_CUR0453D910" cursor with hold for 
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,

JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by --job.companycode, 
anl.job, anl.lab; --Only order by indexed columns from analysis.

Merge Join  (cost=0.56..44872.45 rows=179095 width=32) (actual time=0.078..368.620 rows=178979 loops=1)

  Merge Cond: (anl.job = job.job)
  ->  Index Scan using analysis_job_lab on analysis anl  (cost=0.00..35245.47 rows=179095 width=23) (actual time=0.035..128.460 rows=178979 loops=1)
  ->  Index Scan using job_job_pk on job  (cost=0.00..508.53 rows=13120 width=17) (actual time=0.039..53.733 rows=179005 loops=1)

Total runtime: 388.884 ms

Notice that in these cases the query completes in <400 ms and the other query that involves ordering on columns from both of the joined tables completes in >2300ms.

In the application here, these queries are used by a client application to fill a window's listbox that can be scrolled up or down. If the user changes direction of the scroll, it initiates a new cursor and query to fetch a page of results. If the scrolling motion is in the same direction, it simply continues to fetch more results from the cursor. But each time the direction of movement changes, there can be a significant lag.

Any suggestions would be helpful! I'll assume for now that the indexes and queries can't be improved, but rather that I should tweak more of the postmaster settings. Please correct me if you know better and have time to reply.

Thanks,
-Joshua

P.S. Is it possible to have indexes that involves several columns from different but related tables? If so, where can I learn about them?

> Thanks for any insights and tips you can provide!

>

> Kind Regards,

> -Joshua Berry

>

>

>

HTH

--

Robert...