On 10.10.2014 16:21, Craig James wrote: > On Fri, Oct 10, 2014 at 5:10 AM, Tomas Vondra <tv@xxxxxxxx > <mailto:tv@xxxxxxxx>> wrote: > > > I've gone looking for papers on this topic but from what I read this > > isn't so. To get any noticeable improvement you need to read 10-50% of > > the table and that's effectively the same as reading the entire table > > -- and it still had pretty poor results. All the research I could find > > went into how to analyze the whole table while using a reasonable > > amount of scratch space and how to do it incrementally. > > I think it's really difficult to discuss the estimation without some > basic > agreement on what are the goals. Naturally, we can't get a perfect > estimator with small samples (especially when the sample size is > fixed and > not scaling with the table). But maybe we can improve the estimates > without scanning most of the table? > > FWIW I've been playing with the adaptive estimator described in [1] and > the results looks really interesting, IMHO. So far I was testing it on > synthetic datasets outside the database, but I plan to use it instead of > our estimator, and do some more tests. > > > We've solved this problem using an external (non-Postgres) dynamically > optimizing index. In addition to the "early abort," we also require an > efficient "late start", the equivalent of "offset 100 limit 10". It's a > common problem for web sites that let users page through data with just > a tiny amount of state information (a cookie). Yeah, paging is a known example, both for the inefficiency once you get to pages far away, and because of the planning challenges. I think there are known solutions to this problem (http://use-the-index-luke.com/blog/2013-07/pagination-done-the-postgresql-way), although those are not applicable to all cases. But I'm not sure how that's related to the ndistinct estimation problem, discussed in this thread (or rather in this subthread)? > Our index is for chemical structures. Chemicals are indexed on > chemical fragments > <http://emolecules.com/info/molecular-informatics>. A search > typically starts with 50-200 indexed "columns" (chemical fragments). > The query is always flat, "A and B and ... and Z". The indexed > fragments are both correlated (the existence of one strongly raises > the chances of another) and anti-correlated (certain combinations are > very rare). Maybe I don't understand the problem well enough, but isn't this a perfect match for GIN indexes? I mean, you essentially need to do queries like "WHERE substance @@ ('A & B & !C')" etc. Which is exactly what GIN does, because it keeps pointers to tuples for each fragment. > Static planners simply can't handle the "early abort" condition, > even with good statistics. Many have pointed out that data are > "lumpy" rather than well distributed. A more subtle problem is that > you can have evenly distributed data, but badly distributed > correlations. "Agnes" and "Bob" may be names that are distributed > well in a real-estate database, but it might happen that all of the > information about homes whose owners' names are "Agnes" and "Bob" > occurs at the very end of all of your data because they just got > married and bought a house. > > The end result is that even with perfect statistics on each column, > you're still screwed. The combinatorial explosion of possible > correlations between indexes is intractable. Static planners clearly have limitations, but we don't have dynamic planning in PostgreSQL, so we have to live with them. And if we could improve the quality of estimates - lowering the probability of poorly performing plans, it's probably good to do that. It won't be perfect, but until we have dynamic planning it's better than nothing. regards Tomas -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance