Logarithmic data frequency distributions and the query planner

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 8.1, I have a very interesting database where the distributions of some values in a multi-million rows table is logarithmic (i.e. the most frequent value is an order of magnitude more frequent than the next ones). If I analyze the table, the statistics become extremely skewed towards the most frequent values and this prevents the planner from giving any good results on queries that do not target these entries.

In a recent case, the planner estimated that the number of returned rows would be ~13% of the table size and from this bad assumption generated a very slow conservative plan that executed in days. If I set the statistics at zero for that table, the planner uses a hardcoded ratio (seems like 0.5%) for the number of returned rows and this helps generating a plan that executes in 3 minutes (still sub-optimal, but not as bad).

Generating partial index for the less frequent cases helps, but this solution is not flexible enough for our needs as the number of complex queries grow. We are mostly left with pre-generating a lot of temporary tables whenever the planner over-evaluates the number of rows generated by a subquery (query execution was trimmed from 3 minutes to 30 seconds using this technique) or using the OFFSET 0 tweak, but it would be nice if the planner could handle this on its own.

Am I missing something obvious? Setting the statistics for this table to zero seems awkward even if it gives good results.
Jerry.



--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance


[Postgresql General]     [Postgresql PHP]     [PHP Users]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Yosemite]

  Powered by Linux