[Please keep replies on the list] On 2020-06-24 11:02:22 +0000, Klaudie Willis wrote: > Holzer, thanks for your feedback. Yes, your guess is very good. The > data consists of millions of instruments that occur a handful of cases > (rare), and instruments that are very common. > > I am still a little surprised that it is so hard to sample data and > estimate distinct values within a 1000X. My intuition misleads me into > thinking this should be simpler, but I understand that this is no > simple task at all. To your information, it seems like SQL server > 2016 makes the same "mistake" when calculating distincts (or 1/density > as they call it). I have similar data there that I have looked into, > and it seems "fooled" as well. Yes, estimating the number of distinct values from a relatively small sample is hard when you don't know the underlying distribution. It might be possible to analyze the sample to find the distribution and get a better estimate. But I'm not sure how useful that would really be: If a few values are very common and most very rare you are probably also much more likely to use the common values in a query: And for those you you would massively underestimate their frequency if you had an accurate n_distinct value. That might be just as bad or even worse. I'm wondering whether it would make sense to use a range instead of a single value in cost calculations. Then the planner could prefer plans which had reasonable cost over the whole range over plans which are good at one end of the range but horrible at the other. I'm a afraid that would lead to combinatorial explosion, though. Even with a small number of ranges there would be a large number of possible combinations to calculate. So one would probably have to resort to monte carlo simulation or soemthing like that. hp -- _ | Peter J. Holzer | Story must make more sense than reality. |_|_) | | | | | hjp@xxxxxx | -- Charles Stross, "Creative writing __/ | http://www.hjp.at/ | challenge!"
Attachment:
signature.asc
Description: PGP signature