Re: [HACKERS] Big IN() clauses etc : feature proposal

PFC <lists@xxxxxxxxxx> · Wed, 10 May 2006 16:38:31 +0200

    The problem is that you need a set-returning function to retrieve
the  values. SRFs don't have rowcount estimates, so the plans suck.

What about adding some way of rowcount estimation to SRFs, in the way of:

CREATE FUNCTION foo (para, meters) RETURNS SETOF bar AS
$$ ... function code ... $$ LANGUAGE plpgsql
ROWCOUNT_ESTIMATOR $$ ... estimation code ... $$ ;

Internally, this could create two functions, foo (para, meters) and
estimate_foo(para, meters) that are the same language and coupled
together (just like a SERIAL column and its sequence). The estimator
functions have an implicit return parameter of int8. Parameters may be
NULL when they are not known at query planning time.

What do you think about this idea?

	It would be very useful.
	A few thoughts...

	You need to do some processing to know how many rows the function would  
return.
	Often, this processing will be repeated in the function itself.
	Sometimes it's very simple (ie. the function will RETURN NEXT each  
element in an array, you know the array length...)
	Sometimes, for functions returning few rows, it might be faster to  
compute the entire result set in the cost estimator.

	So, it might be a bit hairy to find a good compromise.

	Ideas on how to do this (clueless hand-waving mode) :

	1- Add new attributes to set-returning functions ; basically a list of  
functions, each returning an estimation parameter (rowcount, cpu tuple  
cost, etc).
	This is just like you said.

	2- Add an "estimator", to a function, which would just be another  
function, returning one row, a record, containing the estimations in  
several columns (rowcount, cpu tuple cost, etc).
	Pros : only one function call to estimate, easier and faster, the  
estimator just leaves the unknown columns to NULL.
	The estimator needs not be in the same language as the function itself.  
It's just another function.

	3- The estimator could be a set-returning function itself which would  
return rows mimicking pg_statistics
	Pros : planner-friendly, the planner would SELECT from the SRF instead of  
looking in pg_statistics, and the estimator could tell the planner that,  
for instance, the function will return unique values.
	Cons : complex, maybe slow

	4- Add simple flags to a function, like :
	- returns unique values
	- returns sorted values (no need to sort my results)
	- please execute me and store my results in a temporary storage, count  
the rows returned, and plan the outer query accordingly
	- etc.