Logarithmic data frequency distributions and the query planner

Jerry Gamache <jerry.gamache@xxxxxxxxxx> · Wed, 07 Jul 2010 16:54:48 -0400

On 8.1, I have a very interesting database where the distributions of 
some values in a multi-million rows table is logarithmic (i.e. the most 
frequent value is an order of magnitude more frequent than the next 
ones). If I analyze the table, the statistics become extremely skewed 
towards the most frequent values and this prevents the planner from 
giving any good results on queries that do not target these entries.

In a recent case, the planner estimated that the number of returned rows 
would be ~13% of the table size and from this bad assumption generated a 
very slow conservative plan that executed in days. If I set the 
statistics at zero for that table, the planner uses a hardcoded ratio 
(seems like 0.5%) for the number of returned rows and this helps 
generating a plan that executes in 3 minutes (still sub-optimal, but not 
as bad).

Generating partial index for the less frequent cases helps, but this 
solution is not flexible enough for our needs as the number of complex 
queries grow. We are mostly left with pre-generating a lot of temporary 
tables whenever the planner over-evaluates the number of rows generated 
by a subquery (query execution was trimmed from 3 minutes to 30 seconds 
using this technique) or using the OFFSET 0 tweak, but it would be nice 
if the planner could handle this on its own.

Am I missing something obvious? Setting the statistics for this table to 
zero seems awkward even if it gives good results.
Jerry.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance