On Tue, Dec 12, 2017 at 01:29:48AM -0800, Jeff Janes wrote: > On Wed, Dec 6, 2017 at 1:46 PM, Justin Pryzby <pryzby@xxxxxxxxxxxxx> wrote: > > On Tue, Dec 05, 2017 at 01:50:11PM -0500, Tom Lane wrote: > > > In any case, given that we do this calculation without regard > > > to any specific index, > > > > One solution is to compute stats (at least correlation) for all indices, > > not > > just expr inds. I did that earlier this year while throwing around/out > > ideas. > > https://www.postgresql.org/message-id/20170707234119. > > GN17566%40telsasoft.com > > When is the correlation of a column which is not the leading column of a > btree index or in a brin index ever used? If we did compute index-specific > correlations, we could maybe just drop pure-column correlations. Yes I think so - correlation is collected for every column, but only used for indices. I also have a comment to myself in that patch to force attstattarget=0 for non-expr indices, to avoid keeping MCV/histogram which duplicates that of their column. > Trying to keep it all in my own head: For sufficiently large number of > > pages, > > bitmap scan should be preferred to idx scan due to reduced random-page-cost > > outweighing its overhead in CPU cost. > > > But CPU cost is probably not why it is losing anyway. > > Index scans get a double bonus from high correlation. It assumes that only > a small fraction of the table will be visited. And then it assumes that > the visits it does make will be largely sequential. I think that you are > saying that for a large enough table, that last assumption is wrong, that > the residual amount of non-correlation is enough to make the table reads > more random than sequential. Maybe. Do you have a test case that > demonstrates this? If so, how big do we need to go, and can you see the > problem on SSD as well as HDD? Right: The "residual"/fine-scale variations (those which are not adequately represented by correlation metric) are/may be non-sequential, so don't get good readahead. The original issue was with an 75GB table (an inheritence child) and an analytic query previously taking ~30min at that point taking 4-5 hours due to random seeks (from duplicate values in a timestamp column with 1second resolution). There would've been very little if any of the previous day's table cached: the child table being queried (by way of its parent) had size roughly same as the server's RAM, and would've been loaded over the course of the preceding 6-30hours, and not frequently accessed. It may be that there's a sharp change once cache no longer effectively mitigates the random heap reads. SSD: good question. Here's an rackspace VM with PG9.6.6, 2GB shared_buffers, 8GB RAM (~4GB of which is being used as OS page cache), and 32GB SSD (with random_page_cost=1). The server is in use by our application. I believe you could scale up the size of the table to see this behavior with any cache size. 0.0001 controls the "jitter", with smaller values being more jittery.. postgres=# CREATE TABLE t(i int,j int) TABLESPACE tmp; CREATE INDEX ON t(i); INSERT INTO t SELECT (0.0001*a+9*(random()-0.5))::int FROM generate_series(1,99999999) a; VACUUM ANALYZE t; public | t | table | pryzbyj | 3458 MB | relpages | 442478 For comparison purposes/baseline; here's a scan on an SEPARATE index freshly built AFTER insertions: postgres=# explain(analyze,buffers) SELECT COUNT(j) FROM t WHERE i BETWEEN 0 AND 4000; First invocation: #1 -> Index Scan using t_i_idx1 on t (cost=0.57..1413352.60 rows=39933001 width=4) (actual time=25.660..52575.127 rows=39996029 loops=1) Buffers: shared hit=1578644 read=286489 written=1084 Subsequent invocations with (extra) effect from OS cache: #2 -> Index Scan using t_i_idx1 on t (cost=0.57..1413352.60 rows=39933001 width=4) (actual time=61.054..37646.556 rows=39996029 loops=1) Buffers: shared hit=1578644 read=286489 written=2223 #3 -> Index Scan using t_i_idx1 on t (cost=0.57..1413352.60 rows=39933001 width=4) (actual time=9.344..31265.398 rows=39996029 loops=1) Buffers: shared hit=1578644 read=286489 written=1192 Dropping that index, and scanning a different range on the non-fresh index: postgres=# explain(analyze,buffers) SELECT COUNT(j) FROM t WHERE i BETWEEN 4000 AND 8000; #1 -> Index Scan using t_i_idx on t (cost=0.57..1546440.47 rows=40298277 width=4) (actual time=95.815..139152.147 rows=40009853 loops=1) Buffers: shared hit=1948069 read=316536 written=3411 Rerunning with cache effects: #2 -> Index Scan using t_i_idx on t (cost=0.57..1546440.47 rows=40298277 width=4) (actual time=203.590..87547.287 rows=40009853 loops=1) Buffers: shared hit=1948069 read=316536 written=5712 #3 -> Index Scan using t_i_idx on t (cost=0.57..1546440.47 rows=40298277 width=4) (actual time=164.504..83768.890 rows=40009853 loops=1) Buffers: shared hit=1948069 read=316536 written=1979 Compare to seq scan: -> Seq Scan on t (cost=0.00..1942478.00 rows=40298277 width=4) (actual time=1173.162..20980.069 rows=40009853 loops=1) Buffers: shared hit=47341 read=395137 Bitmap: -> Bitmap Heap Scan on t (cost=975197.91..2022150.06 rows=40298277 width=4) (actual time=24396.270..39304.813 rows=40009853 loops=1) Buffers: shared read=316536 written=1431 The index scan reads 2.3e6 pages, compared to 4e5 pages (seq) and 3e5 pages (bitmap). And idx scans were 4-7x slower than seq scan. Was the index scan actually that badly affected by CPU cost of revisiting pages (rather than IO costs)? Or did the OS actually fail to cache the 3e5 pages "read"? That would be consistent with running almost 2x faster on the 3rd invocation. The "hits" are largely from pages being revisited and recently accessed. Are the misses (reads) mostly from pages being revisited after already falling out of cache ? Or mostly initial access ? Or ?? If the index scan is really paying an high IO cost for rereads and not primarily a CPU cost, this seems to be something like the "correlated index scan" variant of traditional failure to effectively cache doing seq scan on a sufficiently large table using a MRU buffer - the cache is ineffective for adequately mitigating IO costs when the (re)reads have sufficient "spread". postgres=# SELECT tablename, attname, correlation FROM pg_stats WHERE tablename='t'; tablename | t attname | i correlation | 1 I still want to say that's unreasonble due to (I think) high fraction of nonrandom reads and associated absense of readahead. > I think it would be easier to give bitmap scans their due rather than try > to knock down index scans. But of course a synthetic test case would go a > long way to either one. As Tom said: index scans involving repeated keys are assuming best-case sequential reads for a given computed correlation. I'd be happy if they were costed to avoid that assumption, and instead used some "middle of the road" interpretation (probably based on correlation and MCV fraction?), but, in the alternate, need to distinguish the cost_index cases, rather than adjusting bitmap. This is what led me to play around with stats computation for all indices. Justin