estimate correlation of index separately from table (Re: index fragmentation on insert-only table with non-unique column)

Justin Pryzby <pryzby@xxxxxxxxxxxxx> · Fri, 7 Jul 2017 18:41:19 -0500

Months ago I reported an issue with very slow index scan of tables with high
"correlation" of its indexed column, due to (we concluded at the time)
duplicate/repeated values of that column causing many lseek()s.  References:

https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.GA11880@xxxxxxxxxxxxx
https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#520D6610.8040907@xxxxxxxxxx
https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#n6cmpug13b9rk1srebjvhphg0lm8dou1kn@xxxxxxx

My interim workarounds were an reindex job and daily granularity partitions
(despite having an excessive number of child tables) to query execution time]]
to encourage seq scans for daily analysis jobs rather than idx scan.  I've now
cobbled together a patch to throw around and see if we can improve on that.  I
tried several changes, hoping to discourage index scan.  The logic that seems
to most accurately reflect costs has two changes:  

Postgres' existing behavior stores table correlation (heap value vs. position)
but doesn't look at index correlation, so can't distinguish between a just-built
index, and a highly fragmented index, or one with highly-nonsequential TIDs.
My patch causes ANALYZE to do a TID scan (sampled across the MCVs) to determine
correlation of heap TID vs index tuple logical location (as opposed to the
table correlation, computed as: heap TID vs. heap value).

The second change averages separate correlation values of small lists of 300
consecutive TIDs, rather than the course-granularity/aggregate correlation of a
small sample of pages across the entire index.  Postgres' existing sampling is
designed to give an even sample across all rows.  An issue with this
course-granularity correlation is that the index can have a broad correlation
(to physical heap location), but with many small-scale deviations, which don't
show up due to sampling a relatively small fraction of a large table; and/or
the effect of the deviations is insignificant/noise and correlation is still
computed near 1.

I believe the "large scale" correlation that postgres computes from block
sample fails to well represent small-scale uncorrelated reads which contribute
large number of random reads, but not included in planner cost.

Not only are the index reads highly random (which the planner already assumes),
but the CTIDs referenced within a given btree leaf page are also substantially
non-sequential.  It seems to affect INSERTs which, over a short interval of
time have a small-moderate range of column values, but which still have a
strong overall trend in that column WRT time (and heap location).  Think of
inserting a "normal" distribution of timestamps centered around now().

My original report shows lseek() for nearly every read on the *heap*.  The
original problem was on a table of telecommunications CDRs, indexed by "record
opening time" (start time) and with high correlation value in pg_stats.  We
import data records from a file, which is probably more or less in order of
"end time".

That still displays broad correlation on "start time", but individual TIDs are
nowhere near sequential.  Two phone calls which end in the same 1 second
interval are not unlikely to have started many minutes apart...  but two calls
which end within the same second are very likely to have started within an hour
of each other.. since typical duration is <1h.  But, insert volume is high, and
there are substantial repeated keys, so the portion of an index scan returning
CTIDs for some certain key value (timestamp with second resolution in our case)
ends up reading heap tuples for a non-negligible fraction of the table: maybe
only 1%, but that's 300 seeks across 1% of a table which is 10s of GB ...
which is still 300 seeks, and takes long enough that cache is inadequate to
substantially mitigate the cost.

It's not clear to me that there's a good way to evenly *sample* a fraction of
the index blocks in a manner which is agnostic to different AMs.  Scanning the
entirety of all indices on relations during (auto) analyze may be too
expensive.  So I implemented index scan of the MCV list.  I'm guessing this
might cause the correlation to be under-estimated, and prefer bitmap scans
somewhat more than justified, due to btree insertion logic for repeated keys,
to reduce O(n^2) behavior.  MCV list isn't perfect since that can happen with
eg. normally distributed floating point values (or timestamp with fractional
seconds).

I ran pageinspect on a recent child of the table that triggered the original
report:

ts=# SELECT itemoffset, ctid FROM bt_page_items('cdrs_huawei_pgwrecord_2017_07_07_recordopeningtime_idx',6) LIMIT 22 OFFSET 1;
 itemoffset |  ctid  
------------+--------
          2 | (81,4)
          3 | (5,6)
          4 | (6,5)
          5 | (9,1)
          6 | (10,1)
          7 | (14,1)
          8 | (21,8)
          9 | (25,1)
         10 | (30,5)
         11 | (41,1)
         12 | (48,7)
         13 | (49,3)
         14 | (50,2)
         15 | (51,1)
         16 | (54,1)
         17 | (57,7)
         18 | (58,6)
         19 | (59,6)
         20 | (63,5)
         21 | (71,6)
         22 | (72,5)
         23 | (78,7)
(22 rows)

=> 22 adjacent tuples referencing 22 different heap pages, only 6 of which are sequential => 16 lseek() aka random page cost.

To generate data demonstrating this problem you can do things like:
| CREATE TABLE t(i int, j int);
| CREATE INDEX ON t(i);
\! time for a in `seq 99`; do psql -qc 'INSERT INTO t SELECT * FROM generate_series(1,99)'; done -- or, 
or:
| INSERT INTO t SELECT (0.001*a+9*(random()-0.5))::int FROM generate_series(1,99999) a; -- or, 

or this one, perhaps closest to our problem case:
| INSERT INTO t SELECT a FROM generate_series(1,999) a, generate_series(1,999) b ORDER BY a+b/9.9;

Note, I was able to create a case using floats without repeated keys:
| INSERT INTO w SELECT i/99999.0+pow(2,(-random())) FROM generate_series(1,999999) i ORDER BY i

| ANALYZE t;
-- note: the correlation is even higher for larger tables with same number of
-- repeated keys, which is bad since the cost should go up linearly with the
-- size and associated number of lseek().  That's one component of the problem
-- and I think a necessarily component of any solution.
postgres=# SELECT tablename, attname, correlation, array_length(most_common_vals,1) FROM pg_stats WHERE tablename LIKE 't%';
 tablename | attname | correlation | array_length 
-----------+---------+-------------+--------------
 t         | i       |    0.995212 |           87
 t         | j       |             |             
 t_i_idx   | i       |   -0.892874 |           87
 ^^^^^^^
... this last line is added by my patch.

That's a bad combination, high table correlation means the planner thinks only
a fraction of the heap will be read, and sequentially, so isn't discouraged
from doing index scan.  But index TIDs are much less well correlated (0.89 vs
0.99).

Note: the negative correlation at tuple-level seems to be related to this comment:
src/backend/access/nbtree/nbtinsert.c- *                Once we have chosen the page to put the key on, we'll insert it before
src/backend/access/nbtree/nbtinsert.c: *                any existing equal keys because of the way _bt_binsrch() works.

Note also:
|postgres=# SELECT leaf_fragmentation FROM pgstatindex('t_i_idx');
|leaf_fragmentation | 67.63
.. but keep in mind that leaf_fragmentation only checks leaf page order, and
not CTID order within index pages.  The planner already assumes index pages are
random reads.  Maybe it's a good indicator (?), but doesn't lend itself to an
accurate cost estimation.

For testing purposes, with:
| shared_buffers | 128kB
|  public | t    | table | pryzbyj | 35 MB | 
| SET enable_bitmapscan=off;
| SET enable_seqscan=off;
| SELECT pg_backend_pid();
| SELECT sum(j) FROM t WHERE i<99;
| Time: 3974.205 ms

% sudo strace -p 530 2>/tmp/strace-pg
% sed -r '/^\(read|lseek/!d; s/ .*//' /tmp/strace-pg |sort |uniq -c |sort -nr |head
  39474 lseek(41,
    359 lseek(44,
      8 lseek(18,
=> 40k seeks on an 35MB table, that's 10 seeks per heap page!

open("base/12411/16634", O_RDWR)        = 41
open("base/12411/16636", O_RDWR)        = 44

postgres=# SELECT relfilenode, relname FROM pg_class WHERE relfilenode>16630;
       16634 | t
       16636 | t_i_idx

2017-07-07 17:45:54.075 CDT [6360] WARNING:  HERE 1222: csquared=0.797225 minIO/R-P-C=109.750000 maxIO/R-P-C=4416.000000 costsize.c cost_index 702

With my patch, index scan estimated to take ~4400 seeks, rather than actual
40k, probably because max_IO_cost assumes that a given heap page will be
visited only once.  But in the worst case each of its tuples would require a
separate lseek()....  I'm not suggesting to change that, since making max_IO
100x higher will probably change query plans more dramatically than desired..
But also note that, unpatched, with table correlation >0.99, postgres would've
under-estimated min_IO_cost not by a factor of 10x but by 400x.

| postgres=# REINDEX INDEX t_i_idx;
| postgres=# ANALYZE t;
| postgres=# SELECT tablename, attname, correlation, array_length(most_common_vals,1) FROM pg_stats WHERE tablename LIKE 't%';
|  tablename | attname | correlation | array_length 
| -----------+---------+-------------+--------------
|  t         | i       |    0.995235 |           67
|  t_i_idx   | i       |           1 |           67

=> Correctly distinguishes freshly reindexed table.

% sudo strace -p 6428 2>/tmp/strace-pg8
% sed -r '/^\(read|lseek/!d; s/ .*//' /tmp/strace-pg8 |sort |uniq -c |sort -nr
     99 lseek(37,

2017-07-07 17:49:47.745 CDT [6428] WARNING:  HERE 1222: csquared=1.000000 minIO/R-P-C=108.500000 maxIO/R-P-C=4416.000000 costsize.c cost_index 702
=> csquared=1, so min_IO cost is used instead of something close to max_IO
(this patch doesn't change the computation of those values at all, just changes
correlation, which squared value is used to interpolate between correlated/min
cost and uncorrelated/max cost).

correlated estimate: 108 seeks vs 99 actual, is essentially what unpatched
postgres would've computed by using the table correlation of 0.9952, implicitly
assuming the index to be similarly correlated.

I hope that demonstrates the need to distinguish index correlation from table
correlation.  I'm hoping for comments on the existing patch, specifically if
there's a better way to sample the index than "MCVs or partial index scan".
I've left some fragments in place of earlier implementation involving btree
page level fragmentation (like statindex).  Also: does it make sense to keeping
MCV/histogram for non-expression index which duplicates table column ?  The
stats lookup in selfuncs.c btcostestimate() would have to check for correlation
from index, and rest of stats from its table.

A bitmap layer adds overhead, which should be avoided if not needed.  But it
shouldn't be a huge impact, and I think its relative effect is only high when
returning a small number of rows.  I'm thinking of a few cases.

 - unique / low cardinality index scans or without duplicate keys: should have
   low index_pages_fetched(), so max_IO_cost should not be very different from
   min_io_cost, and new index-based correlation shouldn't have much effect
   different from table correlation.
 - unclustered/uncorrelated tables: tables whose heap have low correlation
   already discouraged from index scan; this includes tables whose column is
   UPDATEd and not just INSERTed;
 - table with correlated heap AND index: csquared should still be ~0.99 and not
   change much;
 - correlated table with uncorrelated index: this is the target case with
   intended behavior change

My apologies: I think that's a bit of a brain dump.

Justin

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 2b63827..1acf118 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -98,6 +98,11 @@ static int	compare_rows(const void *a, const void *b);
 static int acquire_inherited_sample_rows(Relation onerel, int elevel,
 							  HeapTuple *rows, int targrows,
 							  double *totalrows, double *totaldeadrows);
+static int acquire_sample_index(Relation onerel, Relation index, int elevel,
+					HeapTuple *rows, int targrows,
+					double *totalrows, double *totaldeadrows);
+static double correlation(int values_cnt, double corr_xysum);
+static double idx_correlation (Oid indexoid, VacAttrStatsP stats, int targrows, int num_mcvs, Datum *mcv_values);
 static void update_attstats(Oid relid, bool inh,
 				int natts, VacAttrStats **vacattrstats);
 static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull);
@@ -206,6 +211,7 @@ analyze_rel(Oid relid, RangeVar *relation, int options,
 	 * used to do this in get_rel_oids() but seems safer to check after we've
 	 * locked the relation.
 	 */
+	// TODO: if (onerel->rd_rel->relkind == RELKIND_INDEX) => needs a separate sample_rows function?
 	if (onerel->rd_rel->relkind == RELKIND_RELATION ||
 		onerel->rd_rel->relkind == RELKIND_MATVIEW)
 	{
@@ -433,37 +439,41 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
 		{
 			AnlIndexData *thisdata = &indexdata[ind];
 			IndexInfo  *indexInfo;
+			ListCell   *indexpr_item;
 
 			thisdata->indexInfo = indexInfo = BuildIndexInfo(Irel[ind]);
 			thisdata->tupleFract = 1.0; /* fix later if partial */
-			if (indexInfo->ii_Expressions != NIL && va_cols == NIL)
+			if (va_cols != NIL)
 			{
-				ListCell   *indexpr_item = list_head(indexInfo->ii_Expressions);
+				continue;
+			}
 
-				thisdata->vacattrstats = (VacAttrStats **)
+			indexpr_item = list_head(indexInfo->ii_Expressions);
+			thisdata->vacattrstats = (VacAttrStats **)
 					palloc(indexInfo->ii_NumIndexAttrs * sizeof(VacAttrStats *));
-				tcnt = 0;
-				for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
-				{
-					int			keycol = indexInfo->ii_KeyAttrNumbers[i];
+			tcnt = 0;
+			for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++)
+			{
+				int			keycol = indexInfo->ii_KeyAttrNumbers[i];
+				Node	   		*indexkey=NULL;
 
-					if (keycol == 0)
-					{
-						/* Found an index expression */
-						Node	   *indexkey;
-
-						if (indexpr_item == NULL)	/* shouldn't happen */
-							elog(ERROR, "too few entries in indexprs list");
-						indexkey = (Node *) lfirst(indexpr_item);
-						indexpr_item = lnext(indexpr_item);
-						thisdata->vacattrstats[tcnt] =
-							examine_attribute(Irel[ind], i + 1, indexkey);
-						if (thisdata->vacattrstats[tcnt] != NULL)
-							tcnt++;
-					}
+				if (keycol == 0)
+				{
+					/* Found an index expression */
+					if (indexpr_item == NULL)	/* shouldn't happen */
+						elog(ERROR, "too few entries in indexprs list");
+					indexkey = (Node *) lfirst(indexpr_item);
+					indexpr_item = lnext(indexpr_item);
 				}
-				thisdata->attr_cnt = tcnt;
+
+				thisdata->vacattrstats[tcnt] =
+					examine_attribute(Irel[ind], i + 1, indexkey);
+
+				if (thisdata->vacattrstats[tcnt] != NULL)
+					tcnt++;
 			}
+
+			thisdata->attr_cnt = tcnt;
 		}
 	}
 
@@ -548,11 +558,38 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params,
 			MemoryContextResetAndDeleteChildren(col_context);
 		}
 
+# if 0
+		for (ind = 0; ind < nindexes; ind++)
+		{
+			int						indnumrows;
+			double					indtotalrows,
+									indtotaldeadrows;
+			HeapTuple  *indrows= palloc(targrows * sizeof(HeapTuple));
+			/*
+			 * Attempted to do an index scan for each index, in order to get
+			 * correllation stats of the HEAP ; I was thinking
+			 * (probably incorrectly) that the high random cost was
+			 * due to index seeks, and that it was needed to
+			 * traverse the index in logical rather than physical
+			 * order to determine its correction.  But the planner
+			 * already assumes that index reads are random.
+			 */
+
+			indnumrows = acquire_sample_index (onerel, Irel[ind], elevel,
+					indrows, targrows, &indtotalrows, &indtotaldeadrows);
+			compute_index_stats(Irel[ind], indnumrows,
+					indexdata+ind, 1, /* XXX: should update c_i_s to do one index only? */
+					indrows, indnumrows,
+					col_context);
+
+		}
+# else
 		if (hasindex)
-			compute_index_stats(onerel, totalrows,
-								indexdata, nindexes,
-								rows, numrows,
-								col_context);
+			compute_index_stats(onerel, numrows,
+					indexdata, nindexes,
+					rows, numrows,
+					col_context);
+# endif
 
 		MemoryContextSwitchTo(old_context);
 		MemoryContextDelete(col_context);
@@ -721,10 +758,6 @@ compute_index_stats(Relation onerel, double totalrows,
 					rowno;
 		double		totalindexrows;
 
-		/* Ignore index if no columns to analyze and not partial */
-		if (attr_cnt == 0 && indexInfo->ii_Predicate == NIL)
-			continue;
-
 		/*
 		 * Need an EState for evaluation of index expressions and
 		 * partial-index predicates.  Create it in the per-index context to be
@@ -767,6 +800,7 @@ compute_index_stats(Relation onerel, double totalrows,
 				if (!ExecQual(predicate, econtext))
 					continue;
 			}
+
 			numindexrows++;
 
 			if (attr_cnt > 0)
@@ -790,6 +824,7 @@ compute_index_stats(Relation onerel, double totalrows,
 					VacAttrStats *stats = thisdata->vacattrstats[i];
 					int			attnum = stats->attr->attnum;
 
+					stats->rows=rows;
 					if (isnull[attnum - 1])
 					{
 						exprvals[tcnt] = (Datum) 0;
@@ -815,7 +850,7 @@ compute_index_stats(Relation onerel, double totalrows,
 		totalindexrows = ceil(thisdata->tupleFract * totalrows);
 
 		/*
-		 * Now we can compute the statistics for the expression columns.
+		 * Now we can compute the statistics.
 		 */
 		if (numindexrows > 0)
 		{
@@ -830,11 +865,16 @@ compute_index_stats(Relation onerel, double totalrows,
 				stats->exprvals = exprvals + i;
 				stats->exprnulls = exprnulls + i;
 				stats->rowstride = attr_cnt;
+// Low stats target for non-expr indices to avoid having an duplicate MCV and/or hist ?
+// avoid keeping an (duplicate) MCV list or
+// histogram for non-expression columns?
+// if (indexInfo->ii_KeyAttrNumbers[i]!=0) // && stats->attr->attstattarget!=0)
+// stats->attr->attstattarget=0;
+
 				(*stats->compute_stats) (stats,
 										 ind_fetch_func,
 										 numindexrows,
 										 totalindexrows);
-
 				/*
 				 * If the n_distinct option is specified, it overrides the
 				 * above computation.  For indices, we always use just
@@ -896,7 +936,7 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr)
 	/*
 	 * When analyzing an expression index, believe the expression tree's type
 	 * not the column datatype --- the latter might be the opckeytype storage
-	 * type of the opclass, which is not interesting for our purposes.  (Note:
+	 * type of the opclass, which is not interesting for our purposes.  XXX (Note:
 	 * if we did anything with non-expression index columns, we'd need to
 	 * figure out where to get the correct type info from, but for now that's
 	 * not a problem.)	It's not clear whether anyone will care about the
@@ -1478,6 +1518,305 @@ acquire_inherited_sample_rows(Relation onerel, int elevel,
 }
 
 
+// this doesn't work, since broad/course correlation doesn't well represent
+// non-sequential heap page access which may happen at a small scale..
+static int
+acquire_sample_index(Relation onerel, Relation index, int elevel,
+					HeapTuple *rows, int targrows,
+					double *totalrows, double *totaldeadrows)
+{
+	int			numrows = 0;	/* # rows now in reservoir */
+	double		samplerows = 0; /* total # rows collected */
+	double		rowstoskip = -1;	/* -1 means not set yet */
+
+	HeapTuple       targtuple;
+	// Snapshot        snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	// Snapshot        snapshot = GetLatestSnapshot();
+	Snapshot        snapshot = GetActiveSnapshot();
+	// Snapshot        snapshot = GetOldestSnapshot();
+
+	/* Prepare for sampling rows */
+	ReservoirStateData rstate;
+	IndexScanDesc index_scan;
+	reservoir_init_selection_state(&rstate, targrows);
+
+	// ScanKeyData	scankeys[1];
+	// ScanKeyEntryInitialize(scankeys, SK_ISNULL|SK_SEARCHNOTNULL, 1, InvalidStrategy, InvalidOid, InvalidOid, InvalidOid, (Datum) 0);
+	// XXX search not null ??
+	// SK_SEARCHNOTNULL
+	index_scan= index_beginscan(onerel, index, snapshot, 0, 0);
+	// index_rescan(index_scan, scankeys, 0, NULL, 0);
+	index_rescan(index_scan, NULL, 0, NULL, 0);
+
+	// while ((targtuple = index_getnext(index_scan, ForwardScanDirection)) != NULL)
+	// XXX: it's pretty unfortunate to do a full index scan for each table
+	// being analyzed ... esp. since a design goal is to avoid very
+	// inefficient random read patterns during index scans...  consider
+	// alternatives like index scanning for each MCV ?  However I suspect
+	// that would tend to strongly underestimate correlation in some cases
+	// and too strongly discourage index scans..
+	// Probably go back to a block sample of leaf pages, counting the
+	// linearity of its referenced heap blocks within an index page, but
+	// not from one index page to another???
+
+	// for (unsigned long int c=0; (targtuple = index_getnext(index_scan, BackwardScanDirection)) != NULL ; ++c) // Is there a more useful value to use here ???
+	for (unsigned long int c=0; (targtuple = index_getnext(index_scan, ForwardScanDirection)) != NULL ; ++c) // Is there a more useful value to use here ???
+	// ItemPointer tid;
+	// while ((tid = index_getnext_tid(index_scan, ForwardScanDirection)) != NULL)
+	{
+		HeapTuple h = heap_copytuple(targtuple);
+
+		// ItemPointerSet(&(h->t_self), c>>16, c&0xffff );
+		// index_fetch_heap(index_scan);
+
+		ItemPointerSet(&(h->t_self), c>>16, c&0xffff ); // no good, the correlation of heap value WRT index location is 100% ...
+		// h->t_self.ip_blkid.bi_lo=c&0xffff; // >>8; // ItemPointerSet(&(h->t_self), c>>16, c&0xffff );
+
+		/* Vitter's algorithm, see above */
+		// ItemPointerSetBlockNumber(&h->t_self, c);
+		// ItemPointerSetOffsetNumber(&h->t_self, c);
+		// h->t_self.ip_blkid.bi_hi=c>>16;
+		// h->t_self.ip_blkid.bi_lo=c&OffsetNumberMask;
+		// h->t_self.ip_blkid.bi_hi+=c/30;
+
+		if (numrows < targrows) {
+			rows[numrows++] = h;
+			samplerows += 1;
+			continue;
+		}
+
+		if (rowstoskip < 0)
+			rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows);
+
+		if (rowstoskip <= 0)
+		{
+			int k = (int) (targrows * sampler_random_fract(rstate.randstate));
+
+			Assert(k >= 0 && k < targrows);
+			heap_freetuple(rows[k]);
+			rows[k] = h;
+		}
+
+		rowstoskip -= 1;
+		samplerows += 1;
+	}
+
+	index_endscan(index_scan);
+
+	if (numrows == targrows)
+		qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows);
+
+	*totalrows = samplerows;
+	// totalblocks = RelationGetNumberOfBlocks(onerel);
+	// *totalrows = vac_estimate_reltuples(onerel, true, totalblocks, bs.m, liverows);
+	// if (bs.m > 0)
+	// 	*totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5);
+	// else
+	// 	*totaldeadrows = 0.0;
+
+	ereport(elevel,
+			(errmsg("\"%s\": scanned %d of %u pages, "
+					"containing %.0f live rows and %.0f dead rows; "
+					"%d rows in sample, %.0f estimated total rows",
+					RelationGetRelationName(index),
+					-1, -1,
+					-1.0, -1.0,
+					numrows, *totalrows)));
+
+
+	return numrows ;
+}
+
+/*
+ * Return correlation coeffiecient given sum(x*y), where x is a list giving
+ * sort with values order by one thing, and y is list given order sorted by
+ * another thing (eg. heap value vs. heap location).
+ */
+static double
+correlation(int values_cnt, double corr_xysum)
+{
+	/*----------
+	 * Since we know the x and y value sets are both
+	 *		0, 1, ..., values_cnt-1
+	 * we have sum(x) = sum(y) =
+	 *		(values_cnt-1)*values_cnt / 2
+	 * and sum(x^2) = sum(y^2) =
+	 *		(values_cnt-1)*values_cnt*(2*values_cnt-1) / 6.
+	 *----------
+	 * ... and the correlation coefficient reduces to */
+	double corr_xsum = ((double) (values_cnt - 1)) *
+		((double) values_cnt) / 2.0;
+	double corr_x2sum = ((double) (values_cnt - 1)) *
+		((double) values_cnt) * (double) (2 * values_cnt - 1) / 6.0;
+
+	return (values_cnt * fabs(corr_xysum) - corr_xsum * corr_xsum) /
+		(values_cnt * corr_x2sum - corr_xsum * corr_xsum);
+}
+
+/*
+ * Compute avg of separately-computed correlation values, rather than
+ * correlation of single sample across entire table, which overestimates
+ * correlation for large tables.  targrows is the number of pages to sample in
+ * each batch.
+ */
+static double
+idx_correlation (Oid indexoid, VacAttrStatsP stats, int targrows, int num_mcvs, Datum *mcv_values)
+{
+	HeapTuple 	*rows= stats->rows; // reusing this allocation
+	double		corr_sum=0;
+	int		corr_samples=0;
+	int		numrows = 0;
+	double		deadrows = 0;
+	double		liverows = 0;
+	long int	batch;
+	double		samplerows = 0;
+
+	Snapshot        snapshot = GetActiveSnapshot();
+	Oid		heapoid = IndexGetRelation(indexoid, false);
+	Relation	heap = relation_open(heapoid, AccessShareLock);
+	Relation	index= RelationIdGetRelation(indexoid);
+	IndexScanDesc	index_scan = index_beginscan(heap, index, snapshot, 1, 0);
+
+	// int strategy=BTGreaterEqualStrategyNumber;
+	// Oid opfam=get_opfamily_member(opfamily, stats->attrtypid, stats->attrtypid, BTGreaterEqualStrategyNumber);
+
+	/* For sampling: read the first TARGROWS TIDs from each value in the MCV list */
+	// XXX: consider only the first handful of MCVs ?
+	// XXX: poor correlation in index can prolly happen without MCVs, eg. normally-distributed float values without repeated keys */
+	// XXX: .. should we just read the first targrows TIDs returned by the index or is there a better way ??
+
+	for (batch=0; batch<num_mcvs || (!num_mcvs && !batch); ) { // XXX: protect against empty index ?
+		ScanKeyData	scankeys;
+		double		corr_xysum=0;
+		ItemPointer	tid;
+
+		if (num_mcvs) {
+			// XXX: GreaterEqual ?
+			ScanKeyInit(&scankeys, stats->attr->attnum, BTEqualStrategyNumber, get_opcode( ((StdAnalyzeData *)stats->extra_data)->eqopr), mcv_values[batch]);
+		} else {
+			/* No MCVs: single iteration over first targrows tuples returned by index */
+			// XXX SK_SEARCHNOTNULL
+			ScanKeyEntryInitialize(&scankeys, SK_ISNULL|SK_SEARCHNOTNULL, stats->attr->attnum, InvalidStrategy, InvalidOid, InvalidOid, InvalidOid, (Datum)0);
+		}
+
+		index_rescan(index_scan, &scankeys, 1, NULL, 0);
+		for ( ; numrows<targrows && NULL!=(tid=index_getnext_tid(index_scan, ForwardScanDirection)); samplerows++) {
+			if (!ItemPointerIsValid(tid)) {
+				// Shouldn't happen ?
+				deadrows++;
+				continue;
+			}
+
+			// elog(WARNING, "%s %s %d TID: %d,%d", __FILE__, __FUNCTION__, __LINE__, ItemPointerGetBlockNumber(tid), ItemPointerGetOffsetNumber(tid) );
+			rows[numrows]->t_self = *tid;
+			rows[numrows]->t_len=numrows; // abusing this field;
+			numrows++;
+			liverows++;
+		}
+
+		// avoid NaN if many dead tuples? if (!numrows) continue;
+
+		/* Retrieved consecutive TIDs, now compute their (fine-level) correlation */
+		qsort((void *) rows, numrows, sizeof(*rows), compare_rows);
+		for (int j=0; j<numrows; ++j)
+			corr_xysum+=j*rows[j]->t_len;
+
+		corr_samples++;
+		corr_sum+=correlation(numrows, corr_xysum);
+
+		numrows=0;
+		++batch;
+		if (tid==NULL)
+			break;
+			/* Ran out of index in fewer than targrows */
+	}
+
+	ereport(LOG,
+			(errmsg("\"%s(%s)\": scanned %ld batches with total %.0f TIDs, "
+					"containing %.0f live and %.0f dead TIDs; ",
+					RelationGetRelationName(index),
+					NameStr(stats->attr->attname),
+					batch, samplerows,
+					liverows, deadrows)));
+
+	index_endscan(index_scan);
+	relation_close(index, NoLock);
+	relation_close(heap, NoLock);
+	return corr_sum/corr_samples;
+}
+
+#define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX)
+#define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID)
+
+/*
+ * Do a Vitter block walk along entire index, in physical order, to determine
+ * fraction of LEAF nodes which have "next" pointers with a higher block number
+ * than themselves..  For a highly-correlated table, this will be ~1, for a
+ * poorly correlated table(???), or one with many repeated keys, this will be
+ * between 0 and ~0.5, and index scan across those duplicate keys will have a
+ * high random component.
+ * Logic bits stolen from pgstatindex.
+ */
+
+#include "access/nbtree.h"
+#include "catalog/pg_am.h"
+
+static double
+idx_corr_fudge(Relation index, int targrows)
+{
+	BlockNumber totalblocks;
+	BlockSamplerData bs;
+
+	double leaf_pages = 0;
+	double fragments = 0;
+
+	// TransactionId OldestXmin;
+	// OldestXmin = GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM);
+	// Snapshot        snapshot = RegisterSnapshot(GetTransactionSnapshot());
+	// Snapshot        snapshot = GetLatestSnapshot();
+	// Snapshot        snapshot = GetActiveSnapshot();
+	// Snapshot        snapshot = GetOldestSnapshot();
+	if (!IS_BTREE(index)) {
+		relation_close(index, AccessShareLock);
+		return 1;
+	}
+
+	totalblocks = RelationGetNumberOfBlocks(index);
+	BlockSampler_Init(&bs, totalblocks-1, targrows, random());
+
+	while (BlockSampler_HasMore(&bs))
+	{
+		BlockNumber blkno = BlockSampler_Next(&bs);
+		Buffer		buffer;
+		Page		page;
+		BTPageOpaque opaque;
+
+		vacuum_delay_point();
+		CHECK_FOR_INTERRUPTS();
+
+		buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno, RBM_NORMAL, vac_strategy); // bstrategy
+		LockBuffer(buffer, BUFFER_LOCK_SHARE);
+
+		page = BufferGetPage(buffer);
+		opaque = (BTPageOpaque) PageGetSpecialPointer(page);
+
+		if (P_ISDELETED(opaque) || P_IGNORE(opaque))
+			continue;
+		else if (P_ISLEAF(opaque))
+		{
+			leaf_pages++;
+			if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno)
+				fragments++;
+		}
+
+		UnlockReleaseBuffer(buffer);
+
+	}
+
+	relation_close(index, AccessShareLock);
+	return fragments/leaf_pages;
+}
+
 /*
  *	update_attstats() -- update attribute statistics for one relation
  *
@@ -1658,6 +1997,7 @@ ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull)
 	/* exprvals and exprnulls are already offset for proper column */
 	i = rownum * stats->rowstride;
 	*isNull = stats->exprnulls[i];
+
 	return stats->exprvals[i];
 }
 
@@ -2258,7 +2598,7 @@ compute_scalar_stats(VacAttrStatsP stats,
 							  stats->attrtype->typlen == -1);
 	bool		is_varwidth = (!stats->attrtype->typbyval &&
 							   stats->attrtype->typlen < 0);
-	double		corr_xysum;
+	double		corr_xysum=0;
 	SortSupportData ssup;
 	ScalarItem *values;
 	int			values_cnt = 0;
@@ -2352,10 +2692,17 @@ compute_scalar_stats(VacAttrStatsP stats,
 					dups_cnt;
 		int			slot_idx = 0;
 		CompareScalarsContext cxt;
+		float4	   *corrs;
 
 		/* Sort the collected values */
 		cxt.ssup = &ssup;
 		cxt.tupnoLink = tupnoLink;
+
+		/*
+		 * tuples were previously sorted by TID; now, sort by heap
+		 * value, as needed for stats computations, and, for order
+		 * relative to original sort for correlation computation.
+		 */
 		qsort_arg((void *) values, values_cnt, sizeof(ScalarItem),
 				  compare_scalars, (void *) &cxt);
 
@@ -2378,7 +2725,6 @@ compute_scalar_stats(VacAttrStatsP stats,
 		 * is the last item of its group of duplicates (since the group will
 		 * be ordered by tupno).
 		 */
-		corr_xysum = 0;
 		ndistinct = 0;
 		nmultiple = 0;
 		dups_cnt = 0;
@@ -2566,11 +2912,11 @@ compute_scalar_stats(VacAttrStatsP stats,
 			}
 		}
 
+		Datum	   *mcv_values=NULL;
 		/* Generate MCV slot entry */
 		if (num_mcv > 0)
 		{
 			MemoryContext old_context;
-			Datum	   *mcv_values;
 			float4	   *mcv_freqs;
 
 			/* Must copy the target values into anl_context */
@@ -2713,37 +3059,42 @@ compute_scalar_stats(VacAttrStatsP stats,
 			slot_idx++;
 		}
 
-		/* Generate a correlation entry if there are multiple values */
-		if (values_cnt > 1)
-		{
+		/* Will generate a correlation entry if there are multiple values */
+		if (values_cnt>1) {
 			MemoryContext old_context;
-			float4	   *corrs;
-			double		corr_xsum,
-						corr_x2sum;
-
-			/* Must copy the target values into anl_context */
 			old_context = MemoryContextSwitchTo(stats->anl_context);
+			/* Must copy the target values into anl_context */
 			corrs = (float4 *) palloc(sizeof(float4));
 			MemoryContextSwitchTo(old_context);
+		}
 
-			/*----------
-			 * Since we know the x and y value sets are both
-			 *		0, 1, ..., values_cnt-1
-			 * we have sum(x) = sum(y) =
-			 *		(values_cnt-1)*values_cnt / 2
-			 * and sum(x^2) = sum(y^2) =
-			 *		(values_cnt-1)*values_cnt*(2*values_cnt-1) / 6.
-			 *----------
-			 */
-			corr_xsum = ((double) (values_cnt - 1)) *
-				((double) values_cnt) / 2.0;
-			corr_x2sum = ((double) (values_cnt - 1)) *
-				((double) values_cnt) * (double) (2 * values_cnt - 1) / 6.0;
-
-			/* And the correlation coefficient reduces to */
-			corrs[0] = (values_cnt * corr_xysum - corr_xsum * corr_xsum) /
-				(values_cnt * corr_x2sum - corr_xsum * corr_xsum);
+		if (values_cnt>1 && fetchfunc==ind_fetch_func) { /* && num_mcv>0) { */
+			/* Compute alternate/fine-grained correlation if there are MCV list with repeated values... */
+			// elog(WARNING, "%s %s %d value0:%lu", __FILE__, __FUNCTION__, __LINE__, num_mcv?mcv_values[0] : 1 );
+			corrs[0] = idx_correlation (stats->attr->attrelid, stats, samplerows, num_mcv, mcv_values);
+			elog(WARNING, "%s %s %d cors %lf", __FILE__, __FUNCTION__, __LINE__, corrs[0]);
+			stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION;
+			stats->staop[slot_idx] = mystats->ltopr;
+			stats->stanumbers[slot_idx] = corrs;
+			stats->numnumbers[slot_idx] = 1;
+			slot_idx++;
+		}
+		else if (values_cnt>1) // XXX && fetchfunc==ind_fetch_func?  // correlation is now strictly a per-index attribute and not an per-column one
+		{
+			double fudge=1;
+
+			if (fetchfunc==ind_fetch_func) {
+				/* Compute correlation fudge factor for indices with
+				 * high number of duplicate values for an index column,
+				 * causing index scan to be highly random, due to btree
+				 * random insertion logic being used to avoid O(N^2)
+				 * insertion behavior in that case.
+				 */
+				fudge = 1-idx_corr_fudge(RelationIdGetRelation(stats->attr->attrelid), samplerows);
+			}
 
+			corrs[0] = correlation(values_cnt, corr_xysum);
+			// XXX: corrs[0] *= fudge;
 			stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION;
 			stats->staop[slot_idx] = mystats->ltopr;
 			stats->stanumbers[slot_idx] = corrs;
diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c
index eb653cf..4529725 100644
--- a/src/backend/optimizer/path/costsize.c
+++ b/src/backend/optimizer/path/costsize.c
@@ -693,8 +693,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count,
 	/*
 	 * Now interpolate based on estimated index order correlation to get total
 	 * disk I/O cost for main table accesses.
+	 * Note the sign of this expression: as csquared approaches 0, the
+	 * estimate approaches max_IO_cost estimate;
 	 */
 	csquared = indexCorrelation * indexCorrelation;
+	elog(WARNING, "HERE 1222: csquared=%f minIO/R-P-C=%f maxIO/R-P-C=%f %s %s %d",
+			csquared, min_IO_cost/spc_random_page_cost, max_IO_cost/spc_random_page_cost,
+			__FILE__, __FUNCTION__, __LINE__);
 
 	run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost);
 
diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c
index e103f5e..f097580 100644
--- a/src/backend/utils/adt/selfuncs.c
+++ b/src/backend/utils/adt/selfuncs.c
@@ -6874,9 +6874,34 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 	 */
 	MemSet(&vardata, 0, sizeof(vardata));
 
-	if (index->indexkeys[0] != 0)
+	/* Expression --- maybe there are stats for the index itself */
+	relid = index->indexoid;
+	colnum = 1;
+	if (get_index_stats_hook &&
+		(*get_index_stats_hook) (root, relid, colnum, &vardata))
+	{
+		/*
+		 * The hook took control of acquiring a stats tuple.  If it did
+		 * supply a tuple, it'd better have supplied a freefunc.
+		 */
+		elog(WARNING, "HERE 1223: indexCorrelation %s %s %d", __FILE__, __FUNCTION__, __LINE__);
+
+		if (HeapTupleIsValid(vardata.statsTuple) &&
+			!vardata.freefunc)
+			elog(ERROR, "no function provided to release variable stats with");
+	}
+	else if ( NULL != (vardata.statsTuple = SearchSysCache3(STATRELATTINH,
+											 ObjectIdGetDatum(relid),
+											 Int16GetDatum(colnum),
+											 BoolGetDatum(false))))
+	{
+		elog(WARNING, "HERE 1224: indexCorrelation %s %s %d", __FILE__, __FUNCTION__, __LINE__);
+		vardata.freefunc = ReleaseSysCache;
+	}
+	else if (index->indexkeys[0] != 0)
 	{
 		/* Simple variable --- look to stats for the underlying table */
+		elog(WARNING, "HERE 1225: indexCorrelation %s %s %d", __FILE__, __FUNCTION__, __LINE__);
 		RangeTblEntry *rte = planner_rt_fetch(index->rel->relid, root);
 
 		Assert(rte->rtekind == RTE_RELATION);
@@ -6904,32 +6929,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count,
 			vardata.freefunc = ReleaseSysCache;
 		}
 	}
-	else
-	{
-		/* Expression --- maybe there are stats for the index itself */
-		relid = index->indexoid;
-		colnum = 1;
-
-		if (get_index_stats_hook &&
-			(*get_index_stats_hook) (root, relid, colnum, &vardata))
-		{
-			/*
-			 * The hook took control of acquiring a stats tuple.  If it did
-			 * supply a tuple, it'd better have supplied a freefunc.
-			 */
-			if (HeapTupleIsValid(vardata.statsTuple) &&
-				!vardata.freefunc)
-				elog(ERROR, "no function provided to release variable stats with");
-		}
-		else
-		{
-			vardata.statsTuple = SearchSysCache3(STATRELATTINH,
-												 ObjectIdGetDatum(relid),
-												 Int16GetDatum(colnum),
-												 BoolGetDatum(false));
-			vardata.freefunc = ReleaseSysCache;
-		}
-	}
 
 	if (HeapTupleIsValid(vardata.statsTuple))
 	{
-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance