Months ago I reported an issue with very slow index scan of tables with high "correlation" of its indexed column, due to (we concluded at the time) duplicate/repeated values of that column causing many lseek()s. References: https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#20160524173914.GA11880@xxxxxxxxxxxxx https://www.postgresql.org/message-id/flat/520D6610.8040907%40emulex.com#520D6610.8040907@xxxxxxxxxx https://www.postgresql.org/message-id/flat/n6cmpug13b9rk1srebjvhphg0lm8dou1kn%404ax.com#n6cmpug13b9rk1srebjvhphg0lm8dou1kn@xxxxxxx My interim workarounds were an reindex job and daily granularity partitions (despite having an excessive number of child tables) to query execution time]] to encourage seq scans for daily analysis jobs rather than idx scan. I've now cobbled together a patch to throw around and see if we can improve on that. I tried several changes, hoping to discourage index scan. The logic that seems to most accurately reflect costs has two changes: Postgres' existing behavior stores table correlation (heap value vs. position) but doesn't look at index correlation, so can't distinguish between a just-built index, and a highly fragmented index, or one with highly-nonsequential TIDs. My patch causes ANALYZE to do a TID scan (sampled across the MCVs) to determine correlation of heap TID vs index tuple logical location (as opposed to the table correlation, computed as: heap TID vs. heap value). The second change averages separate correlation values of small lists of 300 consecutive TIDs, rather than the course-granularity/aggregate correlation of a small sample of pages across the entire index. Postgres' existing sampling is designed to give an even sample across all rows. An issue with this course-granularity correlation is that the index can have a broad correlation (to physical heap location), but with many small-scale deviations, which don't show up due to sampling a relatively small fraction of a large table; and/or the effect of the deviations is insignificant/noise and correlation is still computed near 1. I believe the "large scale" correlation that postgres computes from block sample fails to well represent small-scale uncorrelated reads which contribute large number of random reads, but not included in planner cost. Not only are the index reads highly random (which the planner already assumes), but the CTIDs referenced within a given btree leaf page are also substantially non-sequential. It seems to affect INSERTs which, over a short interval of time have a small-moderate range of column values, but which still have a strong overall trend in that column WRT time (and heap location). Think of inserting a "normal" distribution of timestamps centered around now(). My original report shows lseek() for nearly every read on the *heap*. The original problem was on a table of telecommunications CDRs, indexed by "record opening time" (start time) and with high correlation value in pg_stats. We import data records from a file, which is probably more or less in order of "end time". That still displays broad correlation on "start time", but individual TIDs are nowhere near sequential. Two phone calls which end in the same 1 second interval are not unlikely to have started many minutes apart... but two calls which end within the same second are very likely to have started within an hour of each other.. since typical duration is <1h. But, insert volume is high, and there are substantial repeated keys, so the portion of an index scan returning CTIDs for some certain key value (timestamp with second resolution in our case) ends up reading heap tuples for a non-negligible fraction of the table: maybe only 1%, but that's 300 seeks across 1% of a table which is 10s of GB ... which is still 300 seeks, and takes long enough that cache is inadequate to substantially mitigate the cost. It's not clear to me that there's a good way to evenly *sample* a fraction of the index blocks in a manner which is agnostic to different AMs. Scanning the entirety of all indices on relations during (auto) analyze may be too expensive. So I implemented index scan of the MCV list. I'm guessing this might cause the correlation to be under-estimated, and prefer bitmap scans somewhat more than justified, due to btree insertion logic for repeated keys, to reduce O(n^2) behavior. MCV list isn't perfect since that can happen with eg. normally distributed floating point values (or timestamp with fractional seconds). I ran pageinspect on a recent child of the table that triggered the original report: ts=# SELECT itemoffset, ctid FROM bt_page_items('cdrs_huawei_pgwrecord_2017_07_07_recordopeningtime_idx',6) LIMIT 22 OFFSET 1; itemoffset | ctid ------------+-------- 2 | (81,4) 3 | (5,6) 4 | (6,5) 5 | (9,1) 6 | (10,1) 7 | (14,1) 8 | (21,8) 9 | (25,1) 10 | (30,5) 11 | (41,1) 12 | (48,7) 13 | (49,3) 14 | (50,2) 15 | (51,1) 16 | (54,1) 17 | (57,7) 18 | (58,6) 19 | (59,6) 20 | (63,5) 21 | (71,6) 22 | (72,5) 23 | (78,7) (22 rows) => 22 adjacent tuples referencing 22 different heap pages, only 6 of which are sequential => 16 lseek() aka random page cost. To generate data demonstrating this problem you can do things like: | CREATE TABLE t(i int, j int); | CREATE INDEX ON t(i); \! time for a in `seq 99`; do psql -qc 'INSERT INTO t SELECT * FROM generate_series(1,99)'; done -- or, or: | INSERT INTO t SELECT (0.001*a+9*(random()-0.5))::int FROM generate_series(1,99999) a; -- or, or this one, perhaps closest to our problem case: | INSERT INTO t SELECT a FROM generate_series(1,999) a, generate_series(1,999) b ORDER BY a+b/9.9; Note, I was able to create a case using floats without repeated keys: | INSERT INTO w SELECT i/99999.0+pow(2,(-random())) FROM generate_series(1,999999) i ORDER BY i | ANALYZE t; -- note: the correlation is even higher for larger tables with same number of -- repeated keys, which is bad since the cost should go up linearly with the -- size and associated number of lseek(). That's one component of the problem -- and I think a necessarily component of any solution. postgres=# SELECT tablename, attname, correlation, array_length(most_common_vals,1) FROM pg_stats WHERE tablename LIKE 't%'; tablename | attname | correlation | array_length -----------+---------+-------------+-------------- t | i | 0.995212 | 87 t | j | | t_i_idx | i | -0.892874 | 87 ^^^^^^^ ... this last line is added by my patch. That's a bad combination, high table correlation means the planner thinks only a fraction of the heap will be read, and sequentially, so isn't discouraged from doing index scan. But index TIDs are much less well correlated (0.89 vs 0.99). Note: the negative correlation at tuple-level seems to be related to this comment: src/backend/access/nbtree/nbtinsert.c- * Once we have chosen the page to put the key on, we'll insert it before src/backend/access/nbtree/nbtinsert.c: * any existing equal keys because of the way _bt_binsrch() works. Note also: |postgres=# SELECT leaf_fragmentation FROM pgstatindex('t_i_idx'); |leaf_fragmentation | 67.63 .. but keep in mind that leaf_fragmentation only checks leaf page order, and not CTID order within index pages. The planner already assumes index pages are random reads. Maybe it's a good indicator (?), but doesn't lend itself to an accurate cost estimation. For testing purposes, with: | shared_buffers | 128kB | public | t | table | pryzbyj | 35 MB | | SET enable_bitmapscan=off; | SET enable_seqscan=off; | SELECT pg_backend_pid(); | SELECT sum(j) FROM t WHERE i<99; | Time: 3974.205 ms % sudo strace -p 530 2>/tmp/strace-pg % sed -r '/^\(read|lseek/!d; s/ .*//' /tmp/strace-pg |sort |uniq -c |sort -nr |head 39474 lseek(41, 359 lseek(44, 8 lseek(18, => 40k seeks on an 35MB table, that's 10 seeks per heap page! open("base/12411/16634", O_RDWR) = 41 open("base/12411/16636", O_RDWR) = 44 postgres=# SELECT relfilenode, relname FROM pg_class WHERE relfilenode>16630; 16634 | t 16636 | t_i_idx 2017-07-07 17:45:54.075 CDT [6360] WARNING: HERE 1222: csquared=0.797225 minIO/R-P-C=109.750000 maxIO/R-P-C=4416.000000 costsize.c cost_index 702 With my patch, index scan estimated to take ~4400 seeks, rather than actual 40k, probably because max_IO_cost assumes that a given heap page will be visited only once. But in the worst case each of its tuples would require a separate lseek().... I'm not suggesting to change that, since making max_IO 100x higher will probably change query plans more dramatically than desired.. But also note that, unpatched, with table correlation >0.99, postgres would've under-estimated min_IO_cost not by a factor of 10x but by 400x. | postgres=# REINDEX INDEX t_i_idx; | postgres=# ANALYZE t; | postgres=# SELECT tablename, attname, correlation, array_length(most_common_vals,1) FROM pg_stats WHERE tablename LIKE 't%'; | tablename | attname | correlation | array_length | -----------+---------+-------------+-------------- | t | i | 0.995235 | 67 | t_i_idx | i | 1 | 67 => Correctly distinguishes freshly reindexed table. % sudo strace -p 6428 2>/tmp/strace-pg8 % sed -r '/^\(read|lseek/!d; s/ .*//' /tmp/strace-pg8 |sort |uniq -c |sort -nr 99 lseek(37, 2017-07-07 17:49:47.745 CDT [6428] WARNING: HERE 1222: csquared=1.000000 minIO/R-P-C=108.500000 maxIO/R-P-C=4416.000000 costsize.c cost_index 702 => csquared=1, so min_IO cost is used instead of something close to max_IO (this patch doesn't change the computation of those values at all, just changes correlation, which squared value is used to interpolate between correlated/min cost and uncorrelated/max cost). correlated estimate: 108 seeks vs 99 actual, is essentially what unpatched postgres would've computed by using the table correlation of 0.9952, implicitly assuming the index to be similarly correlated. I hope that demonstrates the need to distinguish index correlation from table correlation. I'm hoping for comments on the existing patch, specifically if there's a better way to sample the index than "MCVs or partial index scan". I've left some fragments in place of earlier implementation involving btree page level fragmentation (like statindex). Also: does it make sense to keeping MCV/histogram for non-expression index which duplicates table column ? The stats lookup in selfuncs.c btcostestimate() would have to check for correlation from index, and rest of stats from its table. A bitmap layer adds overhead, which should be avoided if not needed. But it shouldn't be a huge impact, and I think its relative effect is only high when returning a small number of rows. I'm thinking of a few cases. - unique / low cardinality index scans or without duplicate keys: should have low index_pages_fetched(), so max_IO_cost should not be very different from min_io_cost, and new index-based correlation shouldn't have much effect different from table correlation. - unclustered/uncorrelated tables: tables whose heap have low correlation already discouraged from index scan; this includes tables whose column is UPDATEd and not just INSERTed; - table with correlated heap AND index: csquared should still be ~0.99 and not change much; - correlated table with uncorrelated index: this is the target case with intended behavior change My apologies: I think that's a bit of a brain dump. Justin
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c index 2b63827..1acf118 100644 --- a/src/backend/commands/analyze.c +++ b/src/backend/commands/analyze.c @@ -98,6 +98,11 @@ static int compare_rows(const void *a, const void *b); static int acquire_inherited_sample_rows(Relation onerel, int elevel, HeapTuple *rows, int targrows, double *totalrows, double *totaldeadrows); +static int acquire_sample_index(Relation onerel, Relation index, int elevel, + HeapTuple *rows, int targrows, + double *totalrows, double *totaldeadrows); +static double correlation(int values_cnt, double corr_xysum); +static double idx_correlation (Oid indexoid, VacAttrStatsP stats, int targrows, int num_mcvs, Datum *mcv_values); static void update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats); static Datum std_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull); @@ -206,6 +211,7 @@ analyze_rel(Oid relid, RangeVar *relation, int options, * used to do this in get_rel_oids() but seems safer to check after we've * locked the relation. */ + // TODO: if (onerel->rd_rel->relkind == RELKIND_INDEX) => needs a separate sample_rows function? if (onerel->rd_rel->relkind == RELKIND_RELATION || onerel->rd_rel->relkind == RELKIND_MATVIEW) { @@ -433,37 +439,41 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params, { AnlIndexData *thisdata = &indexdata[ind]; IndexInfo *indexInfo; + ListCell *indexpr_item; thisdata->indexInfo = indexInfo = BuildIndexInfo(Irel[ind]); thisdata->tupleFract = 1.0; /* fix later if partial */ - if (indexInfo->ii_Expressions != NIL && va_cols == NIL) + if (va_cols != NIL) { - ListCell *indexpr_item = list_head(indexInfo->ii_Expressions); + continue; + } - thisdata->vacattrstats = (VacAttrStats **) + indexpr_item = list_head(indexInfo->ii_Expressions); + thisdata->vacattrstats = (VacAttrStats **) palloc(indexInfo->ii_NumIndexAttrs * sizeof(VacAttrStats *)); - tcnt = 0; - for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++) - { - int keycol = indexInfo->ii_KeyAttrNumbers[i]; + tcnt = 0; + for (i = 0; i < indexInfo->ii_NumIndexAttrs; i++) + { + int keycol = indexInfo->ii_KeyAttrNumbers[i]; + Node *indexkey=NULL; - if (keycol == 0) - { - /* Found an index expression */ - Node *indexkey; - - if (indexpr_item == NULL) /* shouldn't happen */ - elog(ERROR, "too few entries in indexprs list"); - indexkey = (Node *) lfirst(indexpr_item); - indexpr_item = lnext(indexpr_item); - thisdata->vacattrstats[tcnt] = - examine_attribute(Irel[ind], i + 1, indexkey); - if (thisdata->vacattrstats[tcnt] != NULL) - tcnt++; - } + if (keycol == 0) + { + /* Found an index expression */ + if (indexpr_item == NULL) /* shouldn't happen */ + elog(ERROR, "too few entries in indexprs list"); + indexkey = (Node *) lfirst(indexpr_item); + indexpr_item = lnext(indexpr_item); } - thisdata->attr_cnt = tcnt; + + thisdata->vacattrstats[tcnt] = + examine_attribute(Irel[ind], i + 1, indexkey); + + if (thisdata->vacattrstats[tcnt] != NULL) + tcnt++; } + + thisdata->attr_cnt = tcnt; } } @@ -548,11 +558,38 @@ do_analyze_rel(Relation onerel, int options, VacuumParams *params, MemoryContextResetAndDeleteChildren(col_context); } +# if 0 + for (ind = 0; ind < nindexes; ind++) + { + int indnumrows; + double indtotalrows, + indtotaldeadrows; + HeapTuple *indrows= palloc(targrows * sizeof(HeapTuple)); + /* + * Attempted to do an index scan for each index, in order to get + * correllation stats of the HEAP ; I was thinking + * (probably incorrectly) that the high random cost was + * due to index seeks, and that it was needed to + * traverse the index in logical rather than physical + * order to determine its correction. But the planner + * already assumes that index reads are random. + */ + + indnumrows = acquire_sample_index (onerel, Irel[ind], elevel, + indrows, targrows, &indtotalrows, &indtotaldeadrows); + compute_index_stats(Irel[ind], indnumrows, + indexdata+ind, 1, /* XXX: should update c_i_s to do one index only? */ + indrows, indnumrows, + col_context); + + } +# else if (hasindex) - compute_index_stats(onerel, totalrows, - indexdata, nindexes, - rows, numrows, - col_context); + compute_index_stats(onerel, numrows, + indexdata, nindexes, + rows, numrows, + col_context); +# endif MemoryContextSwitchTo(old_context); MemoryContextDelete(col_context); @@ -721,10 +758,6 @@ compute_index_stats(Relation onerel, double totalrows, rowno; double totalindexrows; - /* Ignore index if no columns to analyze and not partial */ - if (attr_cnt == 0 && indexInfo->ii_Predicate == NIL) - continue; - /* * Need an EState for evaluation of index expressions and * partial-index predicates. Create it in the per-index context to be @@ -767,6 +800,7 @@ compute_index_stats(Relation onerel, double totalrows, if (!ExecQual(predicate, econtext)) continue; } + numindexrows++; if (attr_cnt > 0) @@ -790,6 +824,7 @@ compute_index_stats(Relation onerel, double totalrows, VacAttrStats *stats = thisdata->vacattrstats[i]; int attnum = stats->attr->attnum; + stats->rows=rows; if (isnull[attnum - 1]) { exprvals[tcnt] = (Datum) 0; @@ -815,7 +850,7 @@ compute_index_stats(Relation onerel, double totalrows, totalindexrows = ceil(thisdata->tupleFract * totalrows); /* - * Now we can compute the statistics for the expression columns. + * Now we can compute the statistics. */ if (numindexrows > 0) { @@ -830,11 +865,16 @@ compute_index_stats(Relation onerel, double totalrows, stats->exprvals = exprvals + i; stats->exprnulls = exprnulls + i; stats->rowstride = attr_cnt; +// Low stats target for non-expr indices to avoid having an duplicate MCV and/or hist ? +// avoid keeping an (duplicate) MCV list or +// histogram for non-expression columns? +// if (indexInfo->ii_KeyAttrNumbers[i]!=0) // && stats->attr->attstattarget!=0) +// stats->attr->attstattarget=0; + (*stats->compute_stats) (stats, ind_fetch_func, numindexrows, totalindexrows); - /* * If the n_distinct option is specified, it overrides the * above computation. For indices, we always use just @@ -896,7 +936,7 @@ examine_attribute(Relation onerel, int attnum, Node *index_expr) /* * When analyzing an expression index, believe the expression tree's type * not the column datatype --- the latter might be the opckeytype storage - * type of the opclass, which is not interesting for our purposes. (Note: + * type of the opclass, which is not interesting for our purposes. XXX (Note: * if we did anything with non-expression index columns, we'd need to * figure out where to get the correct type info from, but for now that's * not a problem.) It's not clear whether anyone will care about the @@ -1478,6 +1518,305 @@ acquire_inherited_sample_rows(Relation onerel, int elevel, } +// this doesn't work, since broad/course correlation doesn't well represent +// non-sequential heap page access which may happen at a small scale.. +static int +acquire_sample_index(Relation onerel, Relation index, int elevel, + HeapTuple *rows, int targrows, + double *totalrows, double *totaldeadrows) +{ + int numrows = 0; /* # rows now in reservoir */ + double samplerows = 0; /* total # rows collected */ + double rowstoskip = -1; /* -1 means not set yet */ + + HeapTuple targtuple; + // Snapshot snapshot = RegisterSnapshot(GetTransactionSnapshot()); + // Snapshot snapshot = GetLatestSnapshot(); + Snapshot snapshot = GetActiveSnapshot(); + // Snapshot snapshot = GetOldestSnapshot(); + + /* Prepare for sampling rows */ + ReservoirStateData rstate; + IndexScanDesc index_scan; + reservoir_init_selection_state(&rstate, targrows); + + // ScanKeyData scankeys[1]; + // ScanKeyEntryInitialize(scankeys, SK_ISNULL|SK_SEARCHNOTNULL, 1, InvalidStrategy, InvalidOid, InvalidOid, InvalidOid, (Datum) 0); + // XXX search not null ?? + // SK_SEARCHNOTNULL + index_scan= index_beginscan(onerel, index, snapshot, 0, 0); + // index_rescan(index_scan, scankeys, 0, NULL, 0); + index_rescan(index_scan, NULL, 0, NULL, 0); + + // while ((targtuple = index_getnext(index_scan, ForwardScanDirection)) != NULL) + // XXX: it's pretty unfortunate to do a full index scan for each table + // being analyzed ... esp. since a design goal is to avoid very + // inefficient random read patterns during index scans... consider + // alternatives like index scanning for each MCV ? However I suspect + // that would tend to strongly underestimate correlation in some cases + // and too strongly discourage index scans.. + // Probably go back to a block sample of leaf pages, counting the + // linearity of its referenced heap blocks within an index page, but + // not from one index page to another??? + + // for (unsigned long int c=0; (targtuple = index_getnext(index_scan, BackwardScanDirection)) != NULL ; ++c) // Is there a more useful value to use here ??? + for (unsigned long int c=0; (targtuple = index_getnext(index_scan, ForwardScanDirection)) != NULL ; ++c) // Is there a more useful value to use here ??? + // ItemPointer tid; + // while ((tid = index_getnext_tid(index_scan, ForwardScanDirection)) != NULL) + { + HeapTuple h = heap_copytuple(targtuple); + + // ItemPointerSet(&(h->t_self), c>>16, c&0xffff ); + // index_fetch_heap(index_scan); + + ItemPointerSet(&(h->t_self), c>>16, c&0xffff ); // no good, the correlation of heap value WRT index location is 100% ... + // h->t_self.ip_blkid.bi_lo=c&0xffff; // >>8; // ItemPointerSet(&(h->t_self), c>>16, c&0xffff ); + + /* Vitter's algorithm, see above */ + // ItemPointerSetBlockNumber(&h->t_self, c); + // ItemPointerSetOffsetNumber(&h->t_self, c); + // h->t_self.ip_blkid.bi_hi=c>>16; + // h->t_self.ip_blkid.bi_lo=c&OffsetNumberMask; + // h->t_self.ip_blkid.bi_hi+=c/30; + + if (numrows < targrows) { + rows[numrows++] = h; + samplerows += 1; + continue; + } + + if (rowstoskip < 0) + rowstoskip = reservoir_get_next_S(&rstate, samplerows, targrows); + + if (rowstoskip <= 0) + { + int k = (int) (targrows * sampler_random_fract(rstate.randstate)); + + Assert(k >= 0 && k < targrows); + heap_freetuple(rows[k]); + rows[k] = h; + } + + rowstoskip -= 1; + samplerows += 1; + } + + index_endscan(index_scan); + + if (numrows == targrows) + qsort((void *) rows, numrows, sizeof(HeapTuple), compare_rows); + + *totalrows = samplerows; + // totalblocks = RelationGetNumberOfBlocks(onerel); + // *totalrows = vac_estimate_reltuples(onerel, true, totalblocks, bs.m, liverows); + // if (bs.m > 0) + // *totaldeadrows = floor((deadrows / bs.m) * totalblocks + 0.5); + // else + // *totaldeadrows = 0.0; + + ereport(elevel, + (errmsg("\"%s\": scanned %d of %u pages, " + "containing %.0f live rows and %.0f dead rows; " + "%d rows in sample, %.0f estimated total rows", + RelationGetRelationName(index), + -1, -1, + -1.0, -1.0, + numrows, *totalrows))); + + + return numrows ; +} + +/* + * Return correlation coeffiecient given sum(x*y), where x is a list giving + * sort with values order by one thing, and y is list given order sorted by + * another thing (eg. heap value vs. heap location). + */ +static double +correlation(int values_cnt, double corr_xysum) +{ + /*---------- + * Since we know the x and y value sets are both + * 0, 1, ..., values_cnt-1 + * we have sum(x) = sum(y) = + * (values_cnt-1)*values_cnt / 2 + * and sum(x^2) = sum(y^2) = + * (values_cnt-1)*values_cnt*(2*values_cnt-1) / 6. + *---------- + * ... and the correlation coefficient reduces to */ + double corr_xsum = ((double) (values_cnt - 1)) * + ((double) values_cnt) / 2.0; + double corr_x2sum = ((double) (values_cnt - 1)) * + ((double) values_cnt) * (double) (2 * values_cnt - 1) / 6.0; + + return (values_cnt * fabs(corr_xysum) - corr_xsum * corr_xsum) / + (values_cnt * corr_x2sum - corr_xsum * corr_xsum); +} + +/* + * Compute avg of separately-computed correlation values, rather than + * correlation of single sample across entire table, which overestimates + * correlation for large tables. targrows is the number of pages to sample in + * each batch. + */ +static double +idx_correlation (Oid indexoid, VacAttrStatsP stats, int targrows, int num_mcvs, Datum *mcv_values) +{ + HeapTuple *rows= stats->rows; // reusing this allocation + double corr_sum=0; + int corr_samples=0; + int numrows = 0; + double deadrows = 0; + double liverows = 0; + long int batch; + double samplerows = 0; + + Snapshot snapshot = GetActiveSnapshot(); + Oid heapoid = IndexGetRelation(indexoid, false); + Relation heap = relation_open(heapoid, AccessShareLock); + Relation index= RelationIdGetRelation(indexoid); + IndexScanDesc index_scan = index_beginscan(heap, index, snapshot, 1, 0); + + // int strategy=BTGreaterEqualStrategyNumber; + // Oid opfam=get_opfamily_member(opfamily, stats->attrtypid, stats->attrtypid, BTGreaterEqualStrategyNumber); + + /* For sampling: read the first TARGROWS TIDs from each value in the MCV list */ + // XXX: consider only the first handful of MCVs ? + // XXX: poor correlation in index can prolly happen without MCVs, eg. normally-distributed float values without repeated keys */ + // XXX: .. should we just read the first targrows TIDs returned by the index or is there a better way ?? + + for (batch=0; batch<num_mcvs || (!num_mcvs && !batch); ) { // XXX: protect against empty index ? + ScanKeyData scankeys; + double corr_xysum=0; + ItemPointer tid; + + if (num_mcvs) { + // XXX: GreaterEqual ? + ScanKeyInit(&scankeys, stats->attr->attnum, BTEqualStrategyNumber, get_opcode( ((StdAnalyzeData *)stats->extra_data)->eqopr), mcv_values[batch]); + } else { + /* No MCVs: single iteration over first targrows tuples returned by index */ + // XXX SK_SEARCHNOTNULL + ScanKeyEntryInitialize(&scankeys, SK_ISNULL|SK_SEARCHNOTNULL, stats->attr->attnum, InvalidStrategy, InvalidOid, InvalidOid, InvalidOid, (Datum)0); + } + + index_rescan(index_scan, &scankeys, 1, NULL, 0); + for ( ; numrows<targrows && NULL!=(tid=index_getnext_tid(index_scan, ForwardScanDirection)); samplerows++) { + if (!ItemPointerIsValid(tid)) { + // Shouldn't happen ? + deadrows++; + continue; + } + + // elog(WARNING, "%s %s %d TID: %d,%d", __FILE__, __FUNCTION__, __LINE__, ItemPointerGetBlockNumber(tid), ItemPointerGetOffsetNumber(tid) ); + rows[numrows]->t_self = *tid; + rows[numrows]->t_len=numrows; // abusing this field; + numrows++; + liverows++; + } + + // avoid NaN if many dead tuples? if (!numrows) continue; + + /* Retrieved consecutive TIDs, now compute their (fine-level) correlation */ + qsort((void *) rows, numrows, sizeof(*rows), compare_rows); + for (int j=0; j<numrows; ++j) + corr_xysum+=j*rows[j]->t_len; + + corr_samples++; + corr_sum+=correlation(numrows, corr_xysum); + + numrows=0; + ++batch; + if (tid==NULL) + break; + /* Ran out of index in fewer than targrows */ + } + + ereport(LOG, + (errmsg("\"%s(%s)\": scanned %ld batches with total %.0f TIDs, " + "containing %.0f live and %.0f dead TIDs; ", + RelationGetRelationName(index), + NameStr(stats->attr->attname), + batch, samplerows, + liverows, deadrows))); + + index_endscan(index_scan); + relation_close(index, NoLock); + relation_close(heap, NoLock); + return corr_sum/corr_samples; +} + +#define IS_INDEX(r) ((r)->rd_rel->relkind == RELKIND_INDEX) +#define IS_BTREE(r) ((r)->rd_rel->relam == BTREE_AM_OID) + +/* + * Do a Vitter block walk along entire index, in physical order, to determine + * fraction of LEAF nodes which have "next" pointers with a higher block number + * than themselves.. For a highly-correlated table, this will be ~1, for a + * poorly correlated table(???), or one with many repeated keys, this will be + * between 0 and ~0.5, and index scan across those duplicate keys will have a + * high random component. + * Logic bits stolen from pgstatindex. + */ + +#include "access/nbtree.h" +#include "catalog/pg_am.h" + +static double +idx_corr_fudge(Relation index, int targrows) +{ + BlockNumber totalblocks; + BlockSamplerData bs; + + double leaf_pages = 0; + double fragments = 0; + + // TransactionId OldestXmin; + // OldestXmin = GetOldestXmin(onerel, PROCARRAY_FLAGS_VACUUM); + // Snapshot snapshot = RegisterSnapshot(GetTransactionSnapshot()); + // Snapshot snapshot = GetLatestSnapshot(); + // Snapshot snapshot = GetActiveSnapshot(); + // Snapshot snapshot = GetOldestSnapshot(); + if (!IS_BTREE(index)) { + relation_close(index, AccessShareLock); + return 1; + } + + totalblocks = RelationGetNumberOfBlocks(index); + BlockSampler_Init(&bs, totalblocks-1, targrows, random()); + + while (BlockSampler_HasMore(&bs)) + { + BlockNumber blkno = BlockSampler_Next(&bs); + Buffer buffer; + Page page; + BTPageOpaque opaque; + + vacuum_delay_point(); + CHECK_FOR_INTERRUPTS(); + + buffer = ReadBufferExtended(index, MAIN_FORKNUM, blkno, RBM_NORMAL, vac_strategy); // bstrategy + LockBuffer(buffer, BUFFER_LOCK_SHARE); + + page = BufferGetPage(buffer); + opaque = (BTPageOpaque) PageGetSpecialPointer(page); + + if (P_ISDELETED(opaque) || P_IGNORE(opaque)) + continue; + else if (P_ISLEAF(opaque)) + { + leaf_pages++; + if (opaque->btpo_next != P_NONE && opaque->btpo_next < blkno) + fragments++; + } + + UnlockReleaseBuffer(buffer); + + } + + relation_close(index, AccessShareLock); + return fragments/leaf_pages; +} + /* * update_attstats() -- update attribute statistics for one relation * @@ -1658,6 +1997,7 @@ ind_fetch_func(VacAttrStatsP stats, int rownum, bool *isNull) /* exprvals and exprnulls are already offset for proper column */ i = rownum * stats->rowstride; *isNull = stats->exprnulls[i]; + return stats->exprvals[i]; } @@ -2258,7 +2598,7 @@ compute_scalar_stats(VacAttrStatsP stats, stats->attrtype->typlen == -1); bool is_varwidth = (!stats->attrtype->typbyval && stats->attrtype->typlen < 0); - double corr_xysum; + double corr_xysum=0; SortSupportData ssup; ScalarItem *values; int values_cnt = 0; @@ -2352,10 +2692,17 @@ compute_scalar_stats(VacAttrStatsP stats, dups_cnt; int slot_idx = 0; CompareScalarsContext cxt; + float4 *corrs; /* Sort the collected values */ cxt.ssup = &ssup; cxt.tupnoLink = tupnoLink; + + /* + * tuples were previously sorted by TID; now, sort by heap + * value, as needed for stats computations, and, for order + * relative to original sort for correlation computation. + */ qsort_arg((void *) values, values_cnt, sizeof(ScalarItem), compare_scalars, (void *) &cxt); @@ -2378,7 +2725,6 @@ compute_scalar_stats(VacAttrStatsP stats, * is the last item of its group of duplicates (since the group will * be ordered by tupno). */ - corr_xysum = 0; ndistinct = 0; nmultiple = 0; dups_cnt = 0; @@ -2566,11 +2912,11 @@ compute_scalar_stats(VacAttrStatsP stats, } } + Datum *mcv_values=NULL; /* Generate MCV slot entry */ if (num_mcv > 0) { MemoryContext old_context; - Datum *mcv_values; float4 *mcv_freqs; /* Must copy the target values into anl_context */ @@ -2713,37 +3059,42 @@ compute_scalar_stats(VacAttrStatsP stats, slot_idx++; } - /* Generate a correlation entry if there are multiple values */ - if (values_cnt > 1) - { + /* Will generate a correlation entry if there are multiple values */ + if (values_cnt>1) { MemoryContext old_context; - float4 *corrs; - double corr_xsum, - corr_x2sum; - - /* Must copy the target values into anl_context */ old_context = MemoryContextSwitchTo(stats->anl_context); + /* Must copy the target values into anl_context */ corrs = (float4 *) palloc(sizeof(float4)); MemoryContextSwitchTo(old_context); + } - /*---------- - * Since we know the x and y value sets are both - * 0, 1, ..., values_cnt-1 - * we have sum(x) = sum(y) = - * (values_cnt-1)*values_cnt / 2 - * and sum(x^2) = sum(y^2) = - * (values_cnt-1)*values_cnt*(2*values_cnt-1) / 6. - *---------- - */ - corr_xsum = ((double) (values_cnt - 1)) * - ((double) values_cnt) / 2.0; - corr_x2sum = ((double) (values_cnt - 1)) * - ((double) values_cnt) * (double) (2 * values_cnt - 1) / 6.0; - - /* And the correlation coefficient reduces to */ - corrs[0] = (values_cnt * corr_xysum - corr_xsum * corr_xsum) / - (values_cnt * corr_x2sum - corr_xsum * corr_xsum); + if (values_cnt>1 && fetchfunc==ind_fetch_func) { /* && num_mcv>0) { */ + /* Compute alternate/fine-grained correlation if there are MCV list with repeated values... */ + // elog(WARNING, "%s %s %d value0:%lu", __FILE__, __FUNCTION__, __LINE__, num_mcv?mcv_values[0] : 1 ); + corrs[0] = idx_correlation (stats->attr->attrelid, stats, samplerows, num_mcv, mcv_values); + elog(WARNING, "%s %s %d cors %lf", __FILE__, __FUNCTION__, __LINE__, corrs[0]); + stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION; + stats->staop[slot_idx] = mystats->ltopr; + stats->stanumbers[slot_idx] = corrs; + stats->numnumbers[slot_idx] = 1; + slot_idx++; + } + else if (values_cnt>1) // XXX && fetchfunc==ind_fetch_func? // correlation is now strictly a per-index attribute and not an per-column one + { + double fudge=1; + + if (fetchfunc==ind_fetch_func) { + /* Compute correlation fudge factor for indices with + * high number of duplicate values for an index column, + * causing index scan to be highly random, due to btree + * random insertion logic being used to avoid O(N^2) + * insertion behavior in that case. + */ + fudge = 1-idx_corr_fudge(RelationIdGetRelation(stats->attr->attrelid), samplerows); + } + corrs[0] = correlation(values_cnt, corr_xysum); + // XXX: corrs[0] *= fudge; stats->stakind[slot_idx] = STATISTIC_KIND_CORRELATION; stats->staop[slot_idx] = mystats->ltopr; stats->stanumbers[slot_idx] = corrs; diff --git a/src/backend/optimizer/path/costsize.c b/src/backend/optimizer/path/costsize.c index eb653cf..4529725 100644 --- a/src/backend/optimizer/path/costsize.c +++ b/src/backend/optimizer/path/costsize.c @@ -693,8 +693,13 @@ cost_index(IndexPath *path, PlannerInfo *root, double loop_count, /* * Now interpolate based on estimated index order correlation to get total * disk I/O cost for main table accesses. + * Note the sign of this expression: as csquared approaches 0, the + * estimate approaches max_IO_cost estimate; */ csquared = indexCorrelation * indexCorrelation; + elog(WARNING, "HERE 1222: csquared=%f minIO/R-P-C=%f maxIO/R-P-C=%f %s %s %d", + csquared, min_IO_cost/spc_random_page_cost, max_IO_cost/spc_random_page_cost, + __FILE__, __FUNCTION__, __LINE__); run_cost += max_IO_cost + csquared * (min_IO_cost - max_IO_cost); diff --git a/src/backend/utils/adt/selfuncs.c b/src/backend/utils/adt/selfuncs.c index e103f5e..f097580 100644 --- a/src/backend/utils/adt/selfuncs.c +++ b/src/backend/utils/adt/selfuncs.c @@ -6874,9 +6874,34 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count, */ MemSet(&vardata, 0, sizeof(vardata)); - if (index->indexkeys[0] != 0) + /* Expression --- maybe there are stats for the index itself */ + relid = index->indexoid; + colnum = 1; + if (get_index_stats_hook && + (*get_index_stats_hook) (root, relid, colnum, &vardata)) + { + /* + * The hook took control of acquiring a stats tuple. If it did + * supply a tuple, it'd better have supplied a freefunc. + */ + elog(WARNING, "HERE 1223: indexCorrelation %s %s %d", __FILE__, __FUNCTION__, __LINE__); + + if (HeapTupleIsValid(vardata.statsTuple) && + !vardata.freefunc) + elog(ERROR, "no function provided to release variable stats with"); + } + else if ( NULL != (vardata.statsTuple = SearchSysCache3(STATRELATTINH, + ObjectIdGetDatum(relid), + Int16GetDatum(colnum), + BoolGetDatum(false)))) + { + elog(WARNING, "HERE 1224: indexCorrelation %s %s %d", __FILE__, __FUNCTION__, __LINE__); + vardata.freefunc = ReleaseSysCache; + } + else if (index->indexkeys[0] != 0) { /* Simple variable --- look to stats for the underlying table */ + elog(WARNING, "HERE 1225: indexCorrelation %s %s %d", __FILE__, __FUNCTION__, __LINE__); RangeTblEntry *rte = planner_rt_fetch(index->rel->relid, root); Assert(rte->rtekind == RTE_RELATION); @@ -6904,32 +6929,6 @@ btcostestimate(PlannerInfo *root, IndexPath *path, double loop_count, vardata.freefunc = ReleaseSysCache; } } - else - { - /* Expression --- maybe there are stats for the index itself */ - relid = index->indexoid; - colnum = 1; - - if (get_index_stats_hook && - (*get_index_stats_hook) (root, relid, colnum, &vardata)) - { - /* - * The hook took control of acquiring a stats tuple. If it did - * supply a tuple, it'd better have supplied a freefunc. - */ - if (HeapTupleIsValid(vardata.statsTuple) && - !vardata.freefunc) - elog(ERROR, "no function provided to release variable stats with"); - } - else - { - vardata.statsTuple = SearchSysCache3(STATRELATTINH, - ObjectIdGetDatum(relid), - Int16GetDatum(colnum), - BoolGetDatum(false)); - vardata.freefunc = ReleaseSysCache; - } - } if (HeapTupleIsValid(vardata.statsTuple)) {
-- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance