Re: How to avoid seq scans for joins between union-all views (test case included)

Denis de Bernardy <ddebernardy@xxxxxxxxx> · Fri, 13 May 2011 07:23:19 -0700 (PDT)

I might have misread, but:

> select * from connections where locked_by <> 4711
> union all
> select * from connections_locked where locked_by = 4711;

The first part will result in a seq scan irrespective of indexes, and the second has no index on locked_by. The best you can do is to eliminate the seq scan on the second by adding the missing index on locked_by.

That said, note that index usage depends on your data distribution: postgres may identify that it'll read most/all of the table anyway, and opt to do a (cheaper) seq scan instead.

D

----- Original Message -----
> From: Fredrik Widlert <fredrik.widlert@xxxxxxxxx>
> To: pgsql-performance@xxxxxxxxxxxxxx
> Cc: 
> Sent: Friday, May 13, 2011 1:55 PM
> Subject:  How to avoid seq scans for joins between union-all views (test case included)
> 
> Hi everyone
> 
> We have recently started to port an application from Oracle to PostgreSQL.
> So far, we are amazed with how great most things work.
> 
> However, we have run into performance problems in one type of query which
> is quite common in our application. We have created a (simplified)
> reproducible test case which (hopefully!) creates all necessary tables
> and data to
> show the problem.
> 
> Plain-text description of the data model in the test case:
> 
> We have a set of objects (like electrical cables), each having
> two nodes in the table "connections" (think of these two rows together
> as an edge in a graph).
> 
> Another table "connections_locked" contains rows for some of
> the same objects, which are locked by a long transaction.
> 
> The view connections_v performs a union all of the rows from
> "connections" which are not modified in the current long
> transaction with the rows from "connections_locked" which
> are modified in the current long transaction.
> 
> Goal:
> Given an object id, we want to find all neighbors for this
> object (that is, objects which share a node with this object).
> 
> Problem:
> We think that our query used to find neighbors would benefit
> greatly from using some of our indexes, but we fail to make it
> do so.
> 
> 
> Over to the actual test case:
> 
> ----------------------------------------------
> 
> -- Tested on (from select version ()):
> -- PostgreSQL 9.0.1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC)
> 4.1.2 20080704 (Red Hat 4.1.2-46), 32-bit
> -- PostgreSQL 9.1beta1 on i686-pc-linux-gnu, compiled by GCC gcc (GCC)
> 4.1.2 20080704 (Red Hat 4.1.2-46), 32-bit
> 
> -- Ubuntu 11.04, uname -a output:
> -- Linux <hostname> 2.6.38-8-generic-pae #42-Ubuntu SMP Mon Apr 11
> 05:17:09 UTC 2011 i686 i686 i386 GNU/Linux
> -- Processor: Intel(R) Core(TM)2 Quad  CPU   Q9450  @ 2.66GHz
> -- Drive: Intel X25-M SSD
> 
> 
> drop table if exists connections cascade;
> drop table if exists connections_locked cascade;
> 
> 
> create table connections (
>   con_id serial primary key,
>   locked_by integer not null,
>   obj_id integer not null,
>   node integer not null
> );
> 
> 
> -- create test nodes, two per obj_id
> insert into connections (locked_by, obj_id, node)
> select 0, n/2, 1000 + (n + 1)/2 from generate_series (1,500000) as n;
> 
> create index connections_node_idx on connections (node);
> create index connections_obj_idx on connections (obj_id);
> vacuum analyze connections;
> 
> 
> 
> create table connections_locked (
>   con_id integer not null,
>   locked_by integer not null,
>   obj_id integer not null,
>   node integer not null,
>   constraint locked_pk primary key (con_id, locked_by)
> );
> 
> -- mark a few of the objects as locked by a long transaction
> insert into connections_locked (con_id, locked_by, obj_id, node)
> select n, 1 + n/50, n/2, 1000 + (n + 1)/2 from generate_series (1,25000) as n;
> 
> create index connections_locked_node_idx on connections_locked (node);
> create index connections_locked_obj_idx on connections_locked (obj_id);
> vacuum analyze connections_locked;
> 
> 
> -- Create a view showing the world as seen by long transaction 4711.
> -- In real life, this uses a session variable instead of a hard-coded value.
> create or replace view connections_v as
> select * from connections where locked_by <> 4711
> union all
> select * from connections_locked where locked_by = 4711;
> 
> 
> -- This is the query we are trying to optimize.
> -- We expect this to be able to use our indexes, but instead get
> sequential scans
> explain analyze
> select
>      con2.obj_id
> from
>      connections_v con1,
>      connections_v con2
> where
>      con1.obj_id = 17 and
>      con2.node = con1.node
> ;
> 
> 
> -- Output:
> -- Hash Join  (cost=16.69..16368.89 rows=7501 width=4) (actual
> time=0.096..778.830 rows=4 loops=1)
> --   Hash Cond: ("*SELECT* 1".node = "*SELECT* 1".node)
> --   ->  Append  (cost=0.00..14402.00 rows=500050 width=8) (actual
> time=0.011..640.163 rows=500000 loops=1)
> --         ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..13953.00
> rows=500000 width=8) (actual time=0.011..430.645 rows=500000 loops=1)
> --               ->  Seq Scan on connections  (cost=0.00..8953.00
> rows=500000 width=16) (actual time=0.009..178.535 rows=500000 loops=1)
> --                     Filter: (locked_by <> 4711)
> --         ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..449.00
> rows=50 width=8) (actual time=3.254..3.254 rows=0 loops=1)
> --               ->  Seq Scan on connections_locked
> (cost=0.00..448.50 rows=50 width=16) (actual time=3.253..3.253 rows=0
> loops=1)
> --                     Filter: (locked_by = 4711)
> --   ->  Hash  (cost=16.66..16.66 rows=3 width=4) (actual
> time=0.028..0.028 rows=2 loops=1)
> --         Buckets: 1024  Batches: 1  Memory Usage: 1kB
> --         ->  Append  (cost=0.00..16.66 rows=3 width=4) (actual
> time=0.013..0.025 rows=2 loops=1)
> --               ->  Subquery Scan on "*SELECT* 1"  
> (cost=0.00..8.35
> rows=2 width=4) (actual time=0.013..0.016 rows=2 loops=1)
> --                     ->  Index Scan using connections_obj_idx on
> connections  (cost=0.00..8.33 rows=2 width=16) (actual
> time=0.012..0.014 rows=2 loops=1)
> --                           Index Cond: (obj_id = 17)
> --                           Filter: (locked_by <> 4711)
> --               ->  Subquery Scan on "*SELECT* 2"  
> (cost=0.00..8.30
> rows=1 width=4) (actual time=0.008..0.008 rows=0 loops=1)
> --                     ->  Index Scan using connections_locked_obj_idx
> on connections_locked  (cost=0.00..8.29 rows=1 width=16) (actual
> time=0.007..0.007 rows=0 loops=1)
> --                           Index Cond: (obj_id = 17)
> --                           Filter: (locked_by = 4711)
> 
> 
> 
> -- Rewriting the query to an almost-equivalent form yields almost the
> same result (that is, seq scans)
> explain analyze
> select
>     con2.obj_id
> from
>    connections_v con2
>    where con2.node in (select node from connections_v con1 where
> con1.obj_id = 17);
> 
> 
> -- Simplifying the query even more to use a sub-select with a
> hard-coded value still results in seq scans
> explain analyze
> select
>     con2.obj_id
> from
>    connections_v con2
>    where con2.node in (select 1015);
> 
> 
> -- Finally, when we simplify even more and just use a constant, we get
> the index accesses we were hoping
> -- for all along.
> explain analyze
> select
>     con2.obj_id
> from
>    connections_v con2
>    where con2.node in (1015);
> 
> -- Result  (cost=0.00..16.66 rows=3 width=4) (actual time=0.048..0.079
> rows=2 loops=1)
> --   ->  Append  (cost=0.00..16.66 rows=3 width=4) (actual
> time=0.047..0.076 rows=2 loops=1)
> --         ->  Subquery Scan on "*SELECT* 1"  (cost=0.00..8.35 
> rows=2
> width=4) (actual time=0.046..0.049 rows=2 loops=1)
> --               ->  Index Scan using connections_node_idx on
> connections  (cost=0.00..8.33 rows=2 width=16) (actual
> time=0.046..0.048 rows=2 loops=1)
> --                     Index Cond: (node = 1015)
> --                     Filter: (locked_by <> 4711)
> --         ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..8.30 
> rows=1
> width=4) (actual time=0.025..0.025 rows=0 loops=1)
> --               ->  Index Scan using connections_locked_node_idx on
> connections_locked  (cost=0.00..8.29 rows=1 width=16) (actual
> time=0.024..0.024 rows=0 loops=1)
> --                     Index Cond: (node = 1015)
> --                     Filter: (locked_by = 4711)
> 
> 
> 
> ------- end of test case -----
> 
> Can someone explain what is happening here? Is there some way we can
> rewrite our query or some setting we could turn on or off to get the
> optimizer to choose to use our indexes?
> 
> (testing with "set enable_seqscan = false;" does not make a difference
> as far as we can see)
> 
> To verify that we have really created all necessary indexes, we have
> converted this simplified test case to Oracle syntax and tested it on
> our Oracle server. In this case, we do get the expected index accesses,
> so we think that we have in fact managed to isolate the problem using
> this test case.
> 
> What we are hoping for:
> Since we have lots of queries joining these kind of "union all"-views
> between a master table and a transaction table, we would be really
> glad to hear something like "when you use these kinds of views, you
> need to do X, Y and Z to get good performance".
> 
> Thanks in advance for any help!
> /Fredrik
> 
> -- 
> Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance