Re: window function induces full table scan

Thomas Mayer <thomas.mayer@xxxxxxxxxxxxxxx> · Fri, 3 Jan 2014 18:44:50 +0100

Am 03.01.2014 15:54, schrieb Tom Lane:
Thomas Mayer <thomas.mayer@xxxxxxxxxxxxxxx> writes:
To implement the optimization, subquery_is_pushdown_safe() needs to
return true if pushing down the quals to a subquery which has window
functions is in fact safe ("quals that only reference subquery
outputs that are listed in the PARTITION clauses of all window functions
in the subquery").

I'd just remove that check.

Plus, there is a function qual_is_pushdown_safe(...) which contains an
assertion, which might possibly become obsolete:

No, that should stay.  There are no window functions in the upper query's
WHERE, there will be none pushed into the lower's WHERE, and that's as it
must be.

Tom, do you think that these two changes could be sufficient?

Certainly not.  What you'd need to do is include the
is-it-listed-in-all-PARTITION-clauses consideration in the code that marks
"unsafe" subquery output columns.

Clear. That's what I intended to write ;)

For better performance, we could first check subquery->hasWindowFuncs 
and only if this evaluates to true, check if the attribute is marked as 
a "unsafe subquery output column". If it's unsafe 
subquery_is_pushdown_safe() needs to return false.

I was first thinking to do the decision safe/unsafe in the 
subquery_is_pushdown_safe() function or in a new function that is called 
by subquery_is_pushdown_safe().

... "mark" ...: Do I understand you correctly, that you prefer doing the 
decision elsewhere and store the result (safe/unsafe) boolean value 
besides to the subquery output fields? For the push-down, a subquery 
output field must be available anyways.

More in general, one could possibly "mark" safe subquery output fields 
for all the tests in subquery_is_pushdown_safe(). So this function would 
not necessarily have to be in allpaths.c at all. But that kind of 
refactoring would be a different issue which also could be implemented 
separately.

And update all the relevant comments.
And maybe add a couple of regression test cases.

Indeed, that would be nice ;) Straightforward, there should be
- A test case with one window function (in the subquery) and one 
condition, which is safe to be pushed down. Then, assert that 
subquery_is_pushdown_safe() returns true
- A test case with one window function and one condition, which is 
unsafe to be pushed down. Then, assert that subquery_is_pushdown_safe() 
returns false
- A test case with multiple window functions and multiple conditions, 
all safe to be pushed down. Then, assert that 
subquery_is_pushdown_safe() returns true
- A test case with multiple window functions and multiple conditions, 
all except one safe to be pushed down. Then, assert that 
subquery_is_pushdown_safe() returns true for the safe ones and false for 
the unsafe ones
- a test case that ensures that, after the change, the right query plan 
is chosen (integration test).
- execute some example queries (integration test).

What do you think about it? What else needs to be tested?

Offhand I think the details of testing whether a given output column
appears in a given partition clause are identical to testing whether
it appears in the distinctClause.  So you'd just be mechanizing running
through the windowClause list to verify whether this holds for all
the WINDOW clauses.

When a field is element of all PARTITION BY clauses of all window 
functions, it does not mean that this field is distinct. Think of a 
query like

SELECT * FROM (
  SELECT
  b,
  rank() OVER (PARTITION BY b ORDER BY c DESC) AS rankc
  FROM tbl;
) as tmp
WHERE tmp.b=1

In that case, the field b is not distinct (as there is no GROUP BY b). 
Anyways, tmp.b=1 would be safe to to be pushed down.

Do you mean that a GROUP BY b is done implicitly (and internally) at a 
deeper level just for the window function and _therefore_ is distinct at 
that point?

Does the safe/unsafe marking survive recursiveness (over subquery 
levels) then?

Note that if you just look at the windowClause list, then you might
be filtering by named window definitions that appeared in the WINDOW
clause but were never actually referenced by any window function.
I don't have a problem with blowing off the optimization in such cases.
I don't think it's appropriate to expend the cycles that would be needed
to discover whether they're all referenced at this point.  (If anyone ever
complains, it'd be much cheaper to modify the parser to get rid of
unreferenced window definitions.)

			regards, tom lane
.

Best regards
Thomas

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance