Re: Idea for the statistics collector

Bruce Momjian <pgman@xxxxxxxxxxxxxxxx> · Tue, 19 Apr 2005 22:44:57 -0400 (EDT)

Added to TODO list:

	* Log queries where the optimizer row estimates were dramatically
	  different from the number of rows actually found (?)

---------------------------------------------------------------------------

Doug Fields wrote:
> 
> >Tom Lane wrote:
> > > Bruce Momjian <pgman@xxxxxxxxxxxxxxxx> writes:
> > > > Martijn van Oosterhout wrote:
> > > >> Firstly, I was only thinking of going for the basic nodes (Index 
> > Scan, Seq
> > > >> Scan, Distinct). Other types have far more variables. Secondly, even 
> > if you
> > > >> only count, it's useful. For example, if it tells you that the 
> > planner is
> > > >> off by a factor of 10 more than 75% of the time, that's useful 
> > information
> > > >> independant of what the actual variables are.
> > >
> > > And reduce the usefulness even more ;-).  As a pure stats-gathering
> > > exercise it might be worth doing, but not if you only log the failure
> > > cases.  How will you know how well you are doing if you take a
> > > biased-by-design sample?
> 
> Personally, given that it seems like at least once or twice a day someone 
> asks about performance or "why isn't my index being used" and other stuff - 
> I think doing this would be a great idea.
> 
> Perhaps not necessarily in the full-fledged way, but creating a sort of 
> "ANALYZE log," wherein it logs the optimizer's estimate of a query and the 
> actual results of a query, for every query. This, of course, could be 
> enableable/disableable on a per-connection basis, per-table basis (like 
> OIDs), or whatever other basis makes life easiest to the developers.
> 
> Then, when the next ANALYZE is run, it could do it's usual analysis, and 
> apply some additional heuristics based upon what it learns from the 
> "ANALYZE log," possibly to do several things:
> 
> 1) Automatically increase/decrease the SET STATISTICS information included 
> in the analyze, for example, increasing it as a table grows larger and the 
> "randomness" grows less than linearly with size (e.g., if you have 50 or 60 
> groups in a 1,000,000 row table, that certainly needs a higher SET 
> STATISTICS and I do it on my tables).
> 2) Have an additional value on the statistics table called the 
> "index_heuristic" or "random_page_adjustment_heuristic" which when 1 does 
> nothing, but otherwise modifies the cost of using an index/seq scan by that 
> factor - and don't ever change this more than a few percent each ANALYZE
> 3) Flags in a second log (maybe the regular log) really bad query estimates 
> - let it do an analysis of the queries and flag anything two or three std 
> deviations outside.
> 
> Now, I suggest all this stuff in the name of usability and 
> self-maintainability. Unfortunately, I don't have the wherewithal to 
> actually assist in development.
> 
> Another possibility is to put "use_seq_scan" default to OFF, or whatever 
> the parameter is (I did my optimizing a while ago so it's fading), so that 
> if there's an index, it will use it, regardless - as this seems to be what 
> the great majority of people expect to happen. And/or add this to a FAQ, 
> and let us all reply "see http://.../indexfaq.html."; :)
> 
> Cheers,
> 
> Doug
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
> 
> http://www.postgresql.org/users-lounge/docs/faq.html
> 

-- 
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@xxxxxxxxxxxxxxxx               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend