Re: Avoiding cartesian product

"Virag Saksena" <virag@xxxxxxxxxxx> · Sun, 19 Feb 2006 22:42:23 -0800

Szűcs,
    Thanks for your suggestion, I 
guess there is more than one way to attack the problem.

I ended up using a trick with limit to get the next 
row ...

select (b.gc_minor- a.gc_minor), (b.gc_major- 
a.gc_major)
from jam_trace_sys a join jam_trace_sys b on
(b.seq_no = 
(select c.seq_no from jam_trace_sys c
 where c.trace_id = a.trace_id and 
c.seq_no > a.seq_no
 order by c.trace_id, c.seq_no limit 
1)
 and b.trace_id = a.trace_id),
jam_tracesnap s1, jam_tracesnap 
s2
where s1.trace_id = a.trace_id
and s1.seq_no = a.seq_no
and 
s2.trace_id = b.trace_id
and s2.seq_no = b.seq_no
and a.trace_id = 
22
order by a.seq_no;

This gave me a nice clean execution plan (there are 
some extra sources listed, but that is a bug in postgresql) ...

QUERY 
PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
Sort  
(cost=11.24..11.25 rows=1 width=20) (actual time=0.040..0.040 rows=0 
loops=1)
 Sort Key: a.seq_no
 ->Nested Loop  
(cost=0.00..11.23 rows=1 width=20) (actual time=0.028..0.028 rows=0 
loops=1)
    Join Filter: ("inner".seq_no = 
"outer".seq_no)
    ->Nested Loop  (cost=0.00..9.20 
rows=1 width=32) (actual time=0.024..0.024 rows=0 
loops=1)
       Join Filter: ("inner".seq_no = 
"outer".seq_no)
       ->Nested Loop  
(cost=0.00..7.17 rows=1 width=32) (actual time=0.020..0.020 rows=0 
loops=1)
          Join Filter: 
("inner".seq_no = 
(subplan))
          ->Index 
Scan using jam_trace_sys_n1 on jam_trace_sys a  (cost=0.00..3.41 rows=1 
width=16) (actual time=0.016..0.016 rows=0 
loops=1)

Index Cond: (trace_id = 
22)
          ->Index Scan 
using jam_trace_sys_n1 on jam_trace_sys b  (cost=0.00..3.41 rows=1 
width=16) (never 
executed)

Index Cond: (22 = 
trace_id)

SubPlan

->Limit  (cost=0.00..0.33 rows=1 width=8) (never 
executed)

->Index Scan using jam_trace_sys_n1 on jam_trace_sys c  (cost=0.00..6.36 
rows=19 width=8) (never 
executed)

Index Cond: ((trace_id = $0) AND (seq_no > 
$1))
       ->Index Scan using 
jam_tracesnap_n1 on jam_tracesnap s1  (cost=0.00..2.01 rows=1 width=8) 
(never executed)
          Index 
Cond: (22 = trace_id)
    ->Index Scan using 
jam_tracesnap_n1 on jam_tracesnap s2  (cost=0.00..2.01 rows=1 width=8) 
(never executed)
       Index Cond: (22 = 
trace_id)

Regards,

Virag

----- Original Message ----- 
From: "Szűcs Gábor" <surrano@xxxxxxxxx>
To: <pgsql-performance@xxxxxxxxxxxxxx>
Sent: Monday, January 09, 2006 7:59 AM
Subject: Re: Avoiding cartesian 
product

> Dear Virag,
> 
> AFAIK aggregates aren't indexed in 
postgres (at least not before 8.1, which 
> indexes min and max, 
iirc).
> 
> Also, I don't think you need to exactly determine the 
trace_id. Try this one 
> (OTOH; might be wrong):
> 
> select 
DISTINCT ON (a.trace_id, a.seq_no) -- See below
>    b.gc_minor 
- a.gc_minor, b.gc_major - a.gc_major
> from jam_trace_sys a, 
jam_trace_sys b
> where a.trace_id = 22
>    and 
b.trace_id = a.trace_id
>    and b.seq_no > a.seq_no -- 
Simply ">" is enough
> order by a.trace_id, a.seq_no, b.seq_no; -- 
DISTINCT, see below
> 
> The trick is that DISTINCT takes the first 
one in each group (IIRC it is 
> granted, at least someone told me on one 
of these lists :) ) so if you order 
> by the DISTINCT attributes and then 
by b.seq_no, you'll get the smallest of 
> appropriate b.seq_no values for 
each DISTINCT values.
> 
> The idea of DISTINCTing by both columns 
is to make sure the planner finds 
> the index. (lately I had a similar 
problem: WHERE a=1 ORDER BY b LIMIT 1 
> used an index on b, instead of an 
(a,b) index. Using ORDER BY a,b solved it)
> 
> HTH,
> 

> --
> G.
> 
> 
> On 2006.01.04. 5:12, Virag 
Saksena wrote:
> > 
> > I have a table which stores cumulative 
values
> > I would like to display/chart the deltas between successive 
data 
> > collections
> >  
> > If my primary 
key only increments by 1, I could write a simple query
> >  

> > select b.gc_minor - a.gc_minor, b.gc_major - a.gc_major
> 
>   from jam_trace_sys a, jam_trace_sys b
> >  where 
a.trace_id = 22
> >    and b.trace_id = 
a.trace_id
> >    and b.seq_no = a.seq_no + 1
> 
>  order by a.seq_no;
> >  
> > However the 
difference in sequence number is variable.
> > So (in Oracle) I used to 
extract the next seq_no using a correlated 
> > sub-query
> 
>  
> > select b.gc_minor - a.gc_minor, b.gc_major - 
a.gc_major
> > from jam_trace_sys a, jam_trace_sys b
> > where 
a.trace_id = 22
> > and (b.trace_id, b.seq_no) =
> > (select 
a.trace_id, min(c.seq_no) from jam_trace_sys c
> > where c.trace_id = 
a.trace_id and c.seq_no > a.seq_no)
> >  order by 
a.seq_no;
> >  
> > For every row in A, The correlated 
sub-query from C will execute
> > With an appropriate index, it will 
just descend the index Btree
> > go one row to the right and return 
that row (min > :value)
> > and join to table B
> >  

> > SELECT STATEMENT
> >   SORT ORDER BY
> 
>    TABLE ACCESS BY INDEX ROWID JAM_TRACE_SYS B
> 
>      NESTED LOOPS
> 
>        TABLE ACCESS BY INDEX ROWID 
JAM_TRACE_SYS  A
> 
>          INDEX RANGE SCAN 
JAM_TRACE_SYS_N1  A
> >        
INDEX RANGE SCAN JAM_TRACE_SYS_N1 B
> 
>          SORT 
AGGREGATE
> 
>            INDEX 
RANGE SCAN JAM_TRACE_SYS_N1 C
> >  
> > In postgreSQL A 
and B are doing a cartesian product
> > then C gets executed for every 
row in this cartesian product
> > and most of the extra rows get thrown 
out.
> > Is there any way to force an execution plan like above where 
the 
> > correlated subquery runs before going to B.
> > The 
table is small right now, but it will grow to have millions of rows
> > 
QUERY PLAN
> > 
-----------------------------------------------------------------------------------------------------------------------------------
> 
>  Sort  (cost=124911.81..124944.84 rows=13213 width=20) (actual 

> > time=13096.754..13097.053 rows=149 loops=1)
> 
>    Sort Key: a.seq_no
> >    
->  Nested Loop  (cost=4.34..124007.40 rows=13213 width=20) (actual 

> > time=1948.300..13096.329 rows=149 loops=1)
> 
>          Join Filter: 
(subplan)
> >          
->  Seq Scan on jam_trace_sys b  (cost=0.00..3.75 rows=175 
> 
> width=16) (actual time=0.005..0.534 rows=175 loops=1)
> 
>          ->  
Materialize  (cost=4.34..5.85 rows=151 width=16) (actual 
> > 
time=0.002..0.324 rows=150 loops=175)
> 
>                
->  Seq Scan on jam_trace_sys a  (cost=0.00..4.19 
> > 
rows=151 width=16) (actual time=0.022..0.687 rows=150 loops=1)
> 
>                      
Filter: (trace_id = 22)
> 
>          SubPlan
> 
>            
->  Aggregate  (cost=4.67..4.67 rows=1 width=4) (actual 
> 
> time=0.486..0.488 rows=1 loops=26250)
> 
>                  
->  Seq Scan on jam_trace_sys c  (cost=0.00..4.62 
> > 
rows=15 width=4) (actual time=0.058..0.311 rows=74 loops=26250)
> 
>                        
Filter: ((trace_id = $0) AND (seq_no > $1))
> >  Total runtime: 
13097.557 ms
> > (13 rows)
> >  
> > 
pglnx01=> \d jam_trace_sys
> >      Table 
"public.jam_trace_sys"
> >      
Column      |  Type   | 
Modifiers
> > -----------------+---------+-----------
> 
>  trace_id        | integer 
|
> >  
seq_no          | integer |
> 
>  cpu_utilization | integer |
> >  
gc_minor        | integer |
> 
>  gc_major        | integer 
|
> >  heap_used       | integer 
|
> > Indexes:
> >     "jam_trace_sys_n1" 
btree (trace_id, seq_no)
> >  
> > pglnx01=> select 
count(*) from jam_trace_Sys ;
> >  count
> > 
-------
> >    175
> > (1 row)
> 
>  
> > pglnx01=> select trace_id, count(*) from 
jam_trace_sys group by trace_id ;
> >  trace_id | count
> 
> ----------+-------
> >        
15 |     2
> 
>        18 |    21
> 
>        22 |   150
> 
>        16 |     
2
> > (4 rows)
>