Re: Yet another abort-early plan disaster on 9.3

Tomas Vondra <tv@xxxxxxxx> · Fri, 10 Oct 2014 19:53:36 +0200

On 10.10.2014 14:10, Tomas Vondra wrote:
> Dne 10 Říjen 2014, 13:16, Greg Stark napsal(a):
>> On Thu, Oct 2, 2014 at 8:56 PM, Josh Berkus <josh@xxxxxxxxxxxx> wrote:
>>> Yes, it's only intractable if you're wedded to the idea of a tiny,
>>> fixed-size sample.  If we're allowed to sample, say, 1% of the table, we
>>> can get a MUCH more accurate n_distinct estimate using multiple
>>> algorithms, of which HLL is one.  While n_distinct will still have some
>>> variance, it'll be over a much smaller range.
>>
>> I've gone looking for papers on this topic but from what I read this
>> isn't so. To get any noticeable improvement you need to read 10-50% of
>> the table and that's effectively the same as reading the entire table
>> -- and it still had pretty poor results. All the research I could find
>> went into how to analyze the whole table while using a reasonable
>> amount of scratch space and how to do it incrementally.
> 
> I think it's really difficult to discuss the estimation without some basic
> agreement on what are the goals. Naturally, we can't get a perfect
> estimator with small samples (especially when the sample size is fixed and
> not scaling with the table). But maybe we can improve the estimates
> without scanning most of the table?
> 
> FWIW I've been playing with the adaptive estimator described in [1] and
> the results looks really interesting, IMHO. So far I was testing it on
> synthetic datasets outside the database, but I plan to use it instead of
> our estimator, and do some more tests.

Attached is an experimental patch implementing the adaptive estimator.

It was fairly simple (although it's a bit messy). It only computes the
estimates for the  "scalar" case (i.e. data types that we can sort).
Implementing this for the "minimal" case is possible, but requires a bit
more work.

It only computes the estimate and prints a WARNING with both the current
and new estimate, but the old estimate is stored.

I also attach a few synthetic examples of synthetic datasets with
distributions stored in various ways, that I used for testing. In all
cases there's a single table with 10M rows and a single INT column.
There are three kinds of skew:

1) smooth skew

   - N distinct values (100, 10.000 and 100.000 values)
   - average moves to 0 as 'k' increases ('k' between 1 and 9)
   - smooth distribution of frequencies

   INSERT INTO test
        SELECT pow(random(),k) * 10000 FROM generate_series(1,10000000);

2) step skew

   - a few very frequent values, many rare values
   - for example this generates 5 very frequent and ~10k rare values

   INSERT INTO test
        SELECT (CASE WHEN (v < 90000) THEN MOD(v,5) ELSE v END)
          FROM (
             SELECT (random()*100000)::int AS v
               FROM generate_series(1,10000000)
          ) foo;


Results
=======

I tested this with various statistics target settings (10, 100, 1000),
which translates to different sample sizes.

statistics target 100 (default, 30k rows, 0.3% sample)
======================================================

a) smooth skew, 101 values, different skew ('k')

   k   current    adaptive
   -------------------------
   1   101        102
   3   101        102
   5   101        102
   7   101        102
   9   101        102

b) smooth skew, 10.001 values, different skew ('k')

   k   current    adaptive
   -------------------------
   1   9986       10542
   3   8902       10883
   5   7579       10824
   7   6639       10188
   9   5947       10013

c) step skew (different numbers of values)

   values   current    adaptive
   ------------------------------
   106      106        107
   106      35         104
   1006     259        1262
   10006    2823       11047


statistics target 10 (3k rows, 0.03% sample)
============================================

a) smooth skew, 101 values, different skew ('k')

   k   current    adaptive
   -------------------------
   1   101        102
   3   101        102
   5   101        102
   7   101        102
   9   101        102

b) smooth skew, 10.001 values, different skew ('k')

   k   current    adaptive
   -------------------------
   1   9846       10014
   3   4399       7190
   5   2532       5477
   7   1938       4932
   9   1623       1623

c) step skew (different numbers of values)

   values   current    adaptive
   ------------------------------
   106      100        114
   106      5          5
   1006     37         532
   10006    323        20970

statistics target 1000 (300k rows, 3% sample)
=============================================

   k   current    adaptive
   -------------------------
   1   101        102
   3   101        102
   5   101        102
   7   101        102
   9   101        102

b) smooth skew, 10.001 values, different skew ('k')

   k   current    adaptive
   -------------------------
   1   10001      10002
   3   10000      10000
   5   9998       10011
   7   9973       10045
   9   9939       10114

c) step skew (different numbers of values)

   values   current    adaptive
   ------------------------------
   106      106        107
   106      101        107
   1006     957        1096
   10006    9551       10550


Summary
=======

I'm yet to see an example where the adaptive estimator produces worse
results than the current estimator, irrespectedly of the distribution
and sample size / statistics target.

What really matters is the sample size, with respect to the table size,
so I'll use the 0.03%, 0.3%, and 3% instead of the statistics target.

For the large sample (3%) both estimators produce reasonably accurate
results. This however won't work as the tables grow, because the number
of rows we sample is static (does not grow with the table).

As the sample decreases, the adaptive estimator starts winning. For the
0.3% sample the difference is easily 3x for the high-skew cases. E.g.
for one of the "step skew" distributions the actual ndistinct value is
10006, current estimator gives 2823 while adaptive gives 11047. That's
ratio error ~3.5 vs. 1.1x.

For the tiny 0.03% sample the difference gets even more siginficant,
especially for the step-skew cases, where the improvement is often an
order of magnitude.


Proposal
========

I think the adaptive estimator works very well, and I plan to submit it
to the next commitfest after a bit of polishing. Examples of
distributions that break it are welcome of course.

Also, I think it's clear it'd be useful to be able to scale the sample
proportionally to the table (instead of only allowing the current
statistics target approach). I understand it may result in scanning
large part of the table, but I don't see a problem in that's not a
default behavior (and clearly documented). I don't see a way around that
- small samples simply result in poor estimates.

I was thinking that this could be done using statistics target, but it's
already used for other things (e.g. size of MCV list, histogram) and we
don't want to break that by adding yet another function.

Ideas?

regards
Tomas

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index c09ca7e..7018987 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -110,6 +110,7 @@ static void update_attstats(Oid relid, bool inh,
 static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 static Datum ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
 
+static double optimize_estimate(int total_rows, int d, int *f, int r);
 
 /*
  *	analyze_rel() -- analyze one relation
@@ -2360,6 +2361,11 @@ compute_scalar_stats(VacAttrStatsP stats,
 		int			slot_idx = 0;
 		CompareScalarsContext cxt;
 
+		/* f values for the estimator - messy and we likely need much
+		 * less memory, but who cares */
+		int			f_max = 0; /* max number of duplicates */
+		int		   *f_count = (int*)palloc0(sizeof(int)*values_cnt);
+
 		/* Sort the collected values */
 		cxt.ssup = &ssup;
 		cxt.tupnoLink = tupnoLink;
@@ -2401,6 +2407,7 @@ compute_scalar_stats(VacAttrStatsP stats,
 				ndistinct++;
 				if (dups_cnt > 1)
 				{
+
 					nmultiple++;
 					if (track_cnt < num_mcv ||
 						dups_cnt > track[track_cnt - 1].count)
@@ -2426,6 +2433,12 @@ compute_scalar_stats(VacAttrStatsP stats,
 						track[j].first = i + 1 - dups_cnt;
 					}
 				}
+
+				/* increment the number of values with this number of
+				 * repetitions and the largest number of repetitions */
+				f_count[dups_cnt] += 1;
+				f_max = (f_max < dups_cnt) ? dups_cnt : f_max;
+
 				dups_cnt = 0;
 			}
 		}
@@ -2472,6 +2485,7 @@ compute_scalar_stats(VacAttrStatsP stats,
 			double		numer,
 						denom,
 						stadistinct;
+			double		adaptdistinct;	/* adaptive estimate */
 
 			numer = (double) samplerows *(double) d;
 
@@ -2485,6 +2499,12 @@ compute_scalar_stats(VacAttrStatsP stats,
 			if (stadistinct > totalrows)
 				stadistinct = totalrows;
 			stats->stadistinct = floor(stadistinct + 0.5);
+
+			/* compute the adaptive estimate */
+			adaptdistinct = optimize_estimate(totalrows, d, f_count, f_max);
+
+			elog(WARNING, "ndistinct estimate current=%.2f adaptive=%.2f", stadistinct, adaptdistinct);
+
 		}
 
 		/*
@@ -2810,3 +2830,84 @@ compare_mcvs(const void *a, const void *b)
 
 	return da - db;
 }
+
+
+/*
+ * We need to minimize this equality (find "m" solving it)
+ *
+ *     m - f1 - f2 = f1 * (A + A(m)) / (B + B(m))
+ *
+ * where A, B are effectively constants (not depending on m), and A(m)
+ * and B(m) are functions. This is equal to solving
+ *
+ *     0 = f1 * (A + A(m)) / (B + B(m)) - (m - f1 - f2)
+ *
+ * Instead of looking for the exact solution to this equation (which
+ * might be fractional), we'll look for a natural number minimizing
+ * the absolute difference. Number of (distinct) elements is a natural
+ * number, and we don't mind if the number os slightly wrong. It's
+ * just an estimate, after all. The error from sampling will be much
+ * worse in most cases.
+ *
+ * We know the acceptable values of 'm' are [d,N] where 'd' is the number
+ * of distinct elements in the sample, and N is the number of rows in
+ * the table (not just the sample). For large tables (billions of rows)
+ * that'd be quite time-consuming to compute, so we'll approximate the
+ * solution by gradually increasing the step to ~1% of the current value
+ * of 'm'. This will make it much faster and yet very accurate.
+ * 
+ * All this of course assumes the function behaves reasonably (not
+ * oscillating etc.), but that's a safe assumption as the estimator
+ * would perform terribly otherwise.
+ */
+static double
+optimize_estimate(int total_rows, int d, int *f, int r)
+{
+    int     i, m;
+    double  A = 0.0, B = 0.0;
+    int     opt_m = 0;
+    double  opt_diff = total_rows;
+    int     step = 1;
+    int     ndistinct;
+
+    /* compute the 'constant' parts of the equality (A, B) */
+    for (i = 3; i <= r; i++)
+    {
+        double p = i / (double)r;
+        A +=     pow((1 - p), r  ) * f[i];
+        B += i * pow((1 - p), r-1) * f[i];
+    }
+
+    /* find the 'm' value minimizing the difference */
+    for (m = 1; m <= total_rows; m += step)
+    {
+        int    k = (f[1] + 2*f[2]);
+        double q = k / ((double)(r * 2 * m));
+
+        double A_m = A + m * pow((1 - q), r  );
+        double B_m = B + k * pow((1 - q), r-1);
+
+        double diff = fabs(f[1] * (A_m / B_m) - (m - f[1] - f[2]));
+
+        /* if this is a better solution */
+        if (diff < opt_diff)
+        {
+            opt_diff = diff;
+            opt_m = m;
+        }
+
+        /* tweak the step to 1% to make it faster */
+        step = ((int)(0.01 * m) > 1) ? (int)(0.01 * m) : 1;
+    }
+
+    /* compute the final estimate */
+    ndistinct = d + opt_m - f[1] - f[2];
+
+    /* sanity checks that the estimate is within [d,total_rows] */
+    if (ndistinct < d)
+        ndistinct = d;
+    else if (ndistinct > total_rows)
+        ndistinct = total_rows;
+
+    return ndistinct;
+}
Attachment:
ndistinct.sql

Description: application/sql
-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance