Re: Propogating conditions into a query

Kim Bisgaard <kib+pg@xxxxxx> · Thu, 23 Jun 2005 09:35:47 +0200

Hi Tom,

I have now completed the move to PG8.0.3, and feel that I have confirmed 
that this problem is related to the problem I'm having:

Formulated like this, it is not performing:

SELECT station_id, timeobs,temp_grass, temp_dry_at_2m
      FROM temp_dry_at_2m a
      FULL OUTER JOIN temp_grass b
      USING (station_id, timeobs)
      WHERE station_id = 52981
        AND timeobs = '2004-1-1 0:0:0';

Merge Full Join  (cost=1598312.83..11032924.48 rows=6956994 width=32) (actual time=119061.098..133314.306 rows=1 loops=1)
 Merge Cond: (("outer".timeobs = "inner".timeobs) AND ("outer".station_id = "inner".station_id))
 Filter: ((COALESCE("inner".station_id, "outer".station_id) = 52981) AND (COALESCE("inner".timeobs, "outer".timeobs) = '2004-01-01 00:00:00'::timestamp without time zone))
 ->  Sort  (cost=346429.38..352445.11 rows=2406292 width=16) (actual time=20315.241..23850.529 rows=2406292 loops=1)
       Sort Key: b.timeobs, b.station_id
       ->  Seq Scan on temp_grass b  (cost=0.00..41756.92 rows=2406292 width=16) (actual time=10.517..7003.468 rows=2406292 loops=1)
 ->  Sort  (cost=1251883.44..1269275.93 rows=6956994 width=16) (actual time=82122.354..92027.850 rows=6956994 loops=1)
       Sort Key: a.timeobs, a.station_id
       ->  Seq Scan on temp_dry_at_2m a  (cost=0.00..117549.94 rows=6956994 width=16) (actual time=23.759..39930.741 rows=6956994 loops=1)
Total runtime: 133623.422 ms

But Postgresql can do the work, if it is reformulated into:

SELECT station_id, timeobs, temp_grass, temp_dry_at_2m
FROM
  (SELECT station_id, timeobs, temp_dry_at_2m
    FROM temp_dry_at_2m
    WHERE station_id = 52981 AND timeobs = '2004-1-1 0:0:0') a
  FULL OUTER JOIN
  (SELECT station_id, timeobs, temp_grass
    FROM temp_grass
    WHERE station_id = 52981 AND timeobs = '2004-1-1 0:0:0') b
  USING (station_id, timeobs)

Merge Full Join  (cost=0.00..43023.64 rows=10614 width=32) (actual time=0.056..0.064 rows=1 loops=1)
 ->  Index Scan using temp_grass_idx on temp_grass  (cost=0.00..246.55 rows=61 width=16) (actual time=0.029..0.031 rows=1 loops=1)
       Index Cond: ((timeobs = '2004-01-01 00:00:00'::timestamp without time zone) AND (station_id = 52981))
 ->  Index Scan using temp_dry_at_2m_idx on temp_dry_at_2m  (cost=0.00..699.52 rows=174 width=16) (actual time=0.017..0.020 rows=1 loops=1)
       Index Cond: ((timeobs = '2004-01-01 00:00:00'::timestamp without time zone) AND (station_id = 52981))
Total runtime: 0.163 ms

The reason the first query is not performing is because the query 
optimizer does not push the conditions down into the sub-queries - right??

The reason that I do not just use the reformulated query, is that e.g. 
the station_id comes from another table (and there can be more of them), 
so it is bloody inconvenient to first select them, and then repeat them 
a number of time in the above transformation (I need to outer join more 
than two tables) ........

Best regards,

Kim Bisgaard wrote:

Hi Tom,

This sounds like the same "problem" which prevented PG from using the 
indices, and thus giving abyssmal performance in this other thread:

I have two BIG tables (virtually identical) with 3 NOT NULL columns 
Station_id, TimeObs, Temp_XXXX, with unique indexes on (Station_id, 
TimeObs) and valid ANALYSE (set statistics=100). I want to join the 
two tables with a FULL OUTER JOIN.

When I specify the query as:

SELECT station_id, timeobs,temp_grass, temp_dry_at_2m
       FROM temp_dry_at_2m a
       FULL OUTER JOIN temp_grass b        USING (station_id, timeobs)
       WHERE station_id = 52981
         AND timeobs = '2004-1-1 0:0:0' 

Then I would also vote for improving the inteligence of the optimizer! 
:-)

Regards,
Kim.

Tom Lane wrote:

Phil Endecott <spam_from_postgresql_general@xxxxxxxxxxxx> writes:

I don't see anything in there about LEFT OUTER JOIN though.  Any ideas?

Oh, I missed that part of your message.  Hmm, I think the issue is 
that in

D join (M join G on (M.g=G.id)) on (D.id=M.b) where D.id=nnn

the planner deduces M.b=nnn by transitivity, but when the join is an
outer join it can't make the same deduction.

[ thinks some more... ]  If we distinguished conditions that hold below
the join from those that hold above it, we could deduce that M.b=nnn can
be enforced below the join even though it might not be true above it.
There's no such mechanism in existence now, though.

A possible workaround is to generate your query like

D left join (M join G on (M.g=G.id)) on (D.id=M.b AND M.b=nnn) where 
D.id=nnn

but I don't know how practical that is for you.

            regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq

---------------------------(end of broadcast)---------------------------
TIP 5: Have you checked our extensive FAQ?

              http://www.postgresql.org/docs/faq

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend