On 10.10.2014 14:10, Tomas Vondra wrote: > Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a): >> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh@xxxxxxxxxxxx> wrote: >>> Yes, it's only intractable if you're wedded to the idea of a tiny, >>> fixed-size sample. If we're allowed to sample, say, 1% of the table, we >>> can get a MUCH more accurate n_distinct estimate using multiple >>> algorithms, of which HLL is one. While n_distinct will still have some >>> variance, it'll be over a much smaller range. >> >> I've gone looking for papers on this topic but from what I read this >> isn't so. To get any noticeable improvement you need to read 10-50% of >> the table and that's effectively the same as reading the entire table >> -- and it still had pretty poor results. All the research I could find >> went into how to analyze the whole table while using a reasonable >> amount of scratch space and how to do it incrementally. > > I think it's really difficult to discuss the estimation without some basic > agreement on what are the goals. Naturally, we can't get a perfect > estimator with small samples (especially when the sample size is fixed and > not scaling with the table). But maybe we can improve the estimates > without scanning most of the table? > > FWIW I've been playing with the adaptive estimator described in [1] and > the results looks really interesting, IMHO. So far I was testing it on > synthetic datasets outside the database, but I plan to use it instead of > our estimator, and do some more tests. Attached is an experimental patch implementing the adaptive estimator. It was fairly simple (although it's a bit messy). It only computes the estimates for the "scalar" case (i.e. data types that we can sort). Implementing this for the "minimal" case is possible, but requires a bit more work. It only computes the estimate and prints a WARNING with both the current and new estimate, but the old estimate is stored. I also attach a few synthetic examples of synthetic datasets with distributions stored in various ways, that I used for testing. In all cases there's a single table with 10M rows and a single INT column. There are three kinds of skew: 1) smooth skew - N distinct values (100, 10.000 and 100.000 values) - average moves to 0 as 'k' increases ('k' between 1 and 9) - smooth distribution of frequencies INSERT INTO test SELECT pow(random(),k) * 10000 FROM generate_series(1,10000000); 2) step skew - a few very frequent values, many rare values - for example this generates 5 very frequent and ~10k rare values INSERT INTO test SELECT (CASE WHEN (v < 90000) THEN MOD(v,5) ELSE v END) FROM ( SELECT (random()*100000)::int AS v FROM generate_series(1,10000000) ) foo; Results ======= I tested this with various statistics target settings (10, 100, 1000), which translates to different sample sizes. statistics target 100 (default, 30k rows, 0.3% sample) ====================================================== a) smooth skew, 101 values, different skew ('k') k current adaptive ------------------------- 1 101 102 3 101 102 5 101 102 7 101 102 9 101 102 b) smooth skew, 10.001 values, different skew ('k') k current adaptive ------------------------- 1 9986 10542 3 8902 10883 5 7579 10824 7 6639 10188 9 5947 10013 c) step skew (different numbers of values) values current adaptive ------------------------------ 106 106 107 106 35 104 1006 259 1262 10006 2823 11047 statistics target 10 (3k rows, 0.03% sample) ============================================ a) smooth skew, 101 values, different skew ('k') k current adaptive ------------------------- 1 101 102 3 101 102 5 101 102 7 101 102 9 101 102 b) smooth skew, 10.001 values, different skew ('k') k current adaptive ------------------------- 1 9846 10014 3 4399 7190 5 2532 5477 7 1938 4932 9 1623 1623 c) step skew (different numbers of values) values current adaptive ------------------------------ 106 100 114 106 5 5 1006 37 532 10006 323 20970 statistics target 1000 (300k rows, 3% sample) ============================================= k current adaptive ------------------------- 1 101 102 3 101 102 5 101 102 7 101 102 9 101 102 b) smooth skew, 10.001 values, different skew ('k') k current adaptive ------------------------- 1 10001 10002 3 10000 10000 5 9998 10011 7 9973 10045 9 9939 10114 c) step skew (different numbers of values) values current adaptive ------------------------------ 106 106 107 106 101 107 1006 957 1096 10006 9551 10550 Summary ======= I'm yet to see an example where the adaptive estimator produces worse results than the current estimator, irrespectedly of the distribution and sample size / statistics target. What really matters is the sample size, with respect to the table size, so I'll use the 0.03%, 0.3%, and 3% instead of the statistics target. For the large sample (3%) both estimators produce reasonably accurate results. This however won't work as the tables grow, because the number of rows we sample is static (does not grow with the table). As the sample decreases, the adaptive estimator starts winning. For the 0.3% sample the difference is easily 3x for the high-skew cases. E.g. for one of the "step skew" distributions the actual ndistinct value is 10006, current estimator gives 2823 while adaptive gives 11047. That's ratio error ~3.5 vs. 1.1x. For the tiny 0.03% sample the difference gets even more siginficant, especially for the step-skew cases, where the improvement is often an order of magnitude. Proposal ======== I think the adaptive estimator works very well, and I plan to submit it to the next commitfest after a bit of polishing. Examples of distributions that break it are welcome of course. Also, I think it's clear it'd be useful to be able to scale the sample proportionally to the table (instead of only allowing the current statistics target approach). I understand it may result in scanning large part of the table, but I don't see a problem in that's not a default behavior (and clearly documented). I don't see a way around that - small samples simply result in poor estimates. I was thinking that this could be done using statistics target, but it's already used for other things (e.g. size of MCV list, histogram) and we don't want to break that by adding yet another function. Ideas? regards Tomas
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index c09ca7e..7018987 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -110,6 +110,7 @@ static void update_attstats(Oid relid, bool inh, static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull); static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull); +static double optimize_estimate(int total_rows, int d, int *f, int r); /* * analyze_rel() -- analyze one relation @@ -2360,6 +2361,11 @@ compute_scalar_stats(VacAttrStatsP stats, int slot_idx = 0; CompareScalarsContext cxt; + /* f values for the estimator - messy and we likely need much + * less memory, but who cares */ + int f_max = 0; /* max number of duplicates */ + int *f_count = (int*)palloc0(sizeof(int)*values_cnt); + /* Sort the collected values */ cxt.ssup = &ssup; cxt.tupnoLink = tupnoLink; @@ -2401,6 +2407,7 @@ compute_scalar_stats(VacAttrStatsP stats, ndistinct++; if (dups_cnt > 1) { + nmultiple++; if (track_cnt < num_mcv || dups_cnt > track[track_cnt - 1].count) @@ -2426,6 +2433,12 @@ compute_scalar_stats(VacAttrStatsP stats, track[j].first = i + 1 - dups_cnt; } } + + /* increment the number of values with this number of + * repetitions and the largest number of repetitions */ + f_count[dups_cnt] += 1; + f_max = (f_max < dups_cnt) ? dups_cnt : f_max; + dups_cnt = 0; } } @@ -2472,6 +2485,7 @@ compute_scalar_stats(VacAttrStatsP stats, double numer, denom, stadistinct; + double adaptdistinct; /* adaptive estimate */ numer = (double) samplerows *(double) d; @@ -2485,6 +2499,12 @@ compute_scalar_stats(VacAttrStatsP stats, if (stadistinct > totalrows) stadistinct = totalrows; stats->stadistinct = floor(stadistinct + 0.5); + + /* compute the adaptive estimate */ + adaptdistinct = optimize_estimate(totalrows, d, f_count, f_max); + + elog(WARNING, "ndistinct estimate current=%.2f adaptive=%.2f", stadistinct, adaptdistinct); + } /* @@ -2810,3 +2830,84 @@ compare_mcvs(const void *a, const void *b) return da - db; } + + +/* + * We need to minimize this equality (find "m" solving it) + * + * m - f1 - f2 = f1 * (A + A(m)) / (B + B(m)) + * + * where A, B are effectively constants (not depending on m), and A(m) + * and B(m) are functions. This is equal to solving + * + * 0 = f1 * (A + A(m)) / (B + B(m)) - (m - f1 - f2) + * + * Instead of looking for the exact solution to this equation (which + * might be fractional), we'll look for a natural number minimizing + * the absolute difference. Number of (distinct) elements is a natural + * number, and we don't mind if the number os slightly wrong. It's + * just an estimate, after all. The error from sampling will be much + * worse in most cases. + * + * We know the acceptable values of 'm' are [d,N] where 'd' is the number + * of distinct elements in the sample, and N is the number of rows in + * the table (not just the sample). For large tables (billions of rows) + * that'd be quite time-consuming to compute, so we'll approximate the + * solution by gradually increasing the step to ~1% of the current value + * of 'm'. This will make it much faster and yet very accurate. + * + * All this of course assumes the function behaves reasonably (not + * oscillating etc.), but that's a safe assumption as the estimator + * would perform terribly otherwise. + */ +static double +optimize_estimate(int total_rows, int d, int *f, int r) +{ + int i, m; + double A = 0.0, B = 0.0; + int opt_m = 0; + double opt_diff = total_rows; + int step = 1; + int ndistinct; + + /* compute the 'constant' parts of the equality (A, B) */ + for (i = 3; i <= r; i++) + { + double p = i / (double)r; + A += pow((1 - p), r ) * f[i]; + B += i * pow((1 - p), r-1) * f[i]; + } + + /* find the 'm' value minimizing the difference */ + for (m = 1; m <= total_rows; m += step) + { + int k = (f[1] + 2*f[2]); + double q = k / ((double)(r * 2 * m)); + + double A_m = A + m * pow((1 - q), r ); + double B_m = B + k * pow((1 - q), r-1); + + double diff = fabs(f[1] * (A_m / B_m) - (m - f[1] - f[2])); + + /* if this is a better solution */ + if (diff < opt_diff) + { + opt_diff = diff; + opt_m = m; + } + + /* tweak the step to 1% to make it faster */ + step = ((int)(0.01 * m) > 1) ? (int)(0.01 * m) : 1; + } + + /* compute the final estimate */ + ndistinct = d + opt_m - f[1] - f[2]; + + /* sanity checks that the estimate is within [d,total_rows] */ + if (ndistinct < d) + ndistinct = d; + else if (ndistinct > total_rows) + ndistinct = total_rows; + + return ndistinct; +}
Attachment:
ndistinct.sql
Description: application/sql
-- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance