Planner Conceptual Error when Joining a Subquery -- Outer Query Condition not Pulled Into Subquery

Gavin Wahl <gavinwahl@xxxxxxxxx> · Wed, 16 Oct 2013 00:33:56 -0600

I have a pretty simple parent-child relationship, where parents are
segmented into many bins (actually states). I need to query over the
(parent, child) join but filter based on aggregates of the parent. That

is -- all parent, child pairs for parents that are in bin x and also
have more than y children. Here's what the data looks like:

    CREATE TABLE parent (
      id SERIAL PRIMARY KEY,

      bin INTEGER
    );

    CREATE INDEX foo ON parent(bin);

    CREATE TABLE child (
      parent_id INTEGER REFERENCES parent(id),
      data INTEGER

    );

    INSERT INTO parent (bin) SELECT s.a % 50 FROM generate_series(1, 100000) AS s(a);

    INSERT INTO CHILD (parent_id, data) SELECT id, floor(random() * 50) FROM parent;

    INSERT INTO CHILD (parent_id, data) SELECT id, floor(random() * 50) FROM parent;
    INSERT INTO CHILD (parent_id, data) SELECT id, floor(random() * 50) FROM parent;

    ANALYZE parent;

    ANALYZE child;

And the query: (1)

    SELECT *
    FROM parent
    INNER JOIN child ON child.parent_id = parent.id
    LEFT JOIN (SELECT parent.id, COUNT(*) AS c

      FROM parent
      INNER JOIN child ON child.parent_id = parent.id
      WHERE child.data > 25
      GROUP BY 1) agg1 ON agg1.id = parent.id

    WHERE parent.bin = 1
     AND agg1.c >= 3;
This does not perform very well, because the subquery is calculated across
all bins, even the 49 that will be discarded by the base query. The

query plan: http://explain.depesz.com/s/Ty4. I feel like the planner
should be able to move the bin condition into the subquery, like this: (2)

    SELECT *
    FROM parent
    INNER JOIN child ON child.parent_id = parent.id
    LEFT JOIN (SELECT parent.id, COUNT(*) AS c

      FROM parent
      INNER JOIN child ON child.parent_id = parent.id
      WHERE child.data > 25
        -- manually move the base query's condition into the subquery

        AND parent.bin = 1
      GROUP BY 1) agg1 ON agg1.id = parent.id
    WHERE parent.bin = 1
     AND agg1.c >= 3;

This query produces the ideal query plan
(http://explain.depesz.com/s/8aRo), but it feels like we're doing the
planner's work for it. This SQL is generated from a reporting interface,

so it would be nice if this stuff could be figured out automatically. I
know there are other ways to write this query, but this style of joining
an aggregation is really nice for reporting. I actually end up joining

several different aggregations and produce a condition across all of them.

Now, maybe the planner doesn't know about the primary key, and that if
parent.id is the same, parent.bin must be the same. Let's try to give

it this information as part of the join clause: (3)

    SELECT *
    FROM parent
    INNER JOIN child ON child.parent_id = parent.id

    LEFT JOIN (SELECT parent.id, parent.bin, COUNT(*) AS c
      FROM parent
      INNER JOIN child ON child.parent_id = parent.id

      WHERE child.data > 25
      GROUP BY 1, 2) agg1 ON agg1.id = parent.id AND agg1.bin = parent.bin
    WHERE parent.bin = 1

     AND agg1.c >= 3;

This works! Well, at first. As soon as we say `bin IN (1, 2)` 
instead of `bin = 1`, the query plan falls down again:
http://explain.depesz.com/s/u7R: (4)

    SELECT *
    FROM parent
    INNER JOIN child ON child.parent_id = parent.id
    LEFT JOIN (SELECT parent.id, parent.bin, COUNT(*) AS c

      FROM parent
      INNER JOIN child ON child.parent_id = parent.id
      WHERE child.data > 25
      GROUP BY 1, 2) agg1 ON agg1.id = parent.id AND agg1.bin = parent.bin

    WHERE parent.bin IN (1, 2)
     AND agg1.c >= 3;

Note that again, moving the condition inside the subquery produces the
correct plan.

It'd be nice if the planner could optimize the query (1) by turning it

into (2). I understand that it might not be able to, but if it can pull
the condition up in (3), why can't it in (4)?

PS: This is on postgres 9.3