Summary: Non-unique btree indices are returning CTIDs for rows with same value of indexed column not in logical order, imposing a high performance penalty. Running PG 9.5.3 now, we have a time-based partitions of append-only tables with data loaded from other sources. The tables are partitioned by time, and timestamp column has an non-unique, not-null btree index. The child tables are each ~75GB and expected to keep growing. For a child table with a week's worth of data: relpages | 11255802 reltuples | 5.90502e+07 The data is loaded shortly after it's available, so have high correlation in pg_statistic: [pryzbyj@viaero ~]$ psql ts -c "SELECT tablename, correlation, n_distinct FROM pg_stats s JOIN pg_class c ON (c.relname=s.tablename) WHERE tablename LIKE 'cdrs_huawei_pgwrecord%' AND attname='recordopeningtime' ORDER BY 1" |head tablename | correlation | n_distinct ----------------------------------+-------------+------------ cdrs_huawei_pgwrecord | 0.999997 | 102892 cdrs_huawei_pgwrecord_2016_02_15 | 0.999658 | 96145 cdrs_huawei_pgwrecord_2016_02_22 | 0.999943 | 91916 cdrs_huawei_pgwrecord_2016_02_29 | 0.997219 | 50341 cdrs_huawei_pgwrecord_2016_03_01 | 0.999947 | 97485 But note the non-uniqueness of the index column: ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22' GROUP BY 1 ORDER BY 2 DESC; recordopeningtime | count ---------------------+------- 2016-05-21 12:17:29 | 176 2016-05-21 12:17:25 | 171 2016-05-21 13:11:33 | 170 2016-05-21 10:20:02 | 169 2016-05-21 11:30:02 | 167 [...] We have an daily analytic query which processes the previous day's data. For new child tables, with only 1 days data loaded, this runs in ~30min, and for child tables with an entire week's worth of data loaded, takes several hours (even though both queries process the same amount of data). First, I found I was able to get 30-50min query results on full week's table by prefering a seq scan to an index scan. The row estimates seemed fine, and the only condition is the timestamp, so the planner's use of index scan is as expected. AFAICT what's happening is that the index scan was returning pages nonsequentially. strace-ing the backend showed alternating lseek()s and read()s, with the offsets not consistently increasing (nor consistently decreasing): % sudo strace -p 25588 2>&1 |grep -m9 'lseek(773' lseek(773, 1059766272, SEEK_SET) = 1059766272 lseek(773, 824926208, SEEK_SET) = 824926208 lseek(773, 990027776, SEEK_SET) = 990027776 lseek(773, 990330880, SEEK_SET) = 990330880 lseek(773, 1038942208, SEEK_SET) = 1038942208 lseek(773, 1059856384, SEEK_SET) = 1059856384 lseek(773, 977305600, SEEK_SET) = 977305600 lseek(773, 990347264, SEEK_SET) = 990347264 lseek(773, 871096320, SEEK_SET) = 871096320 .. and consecutive read()s being rare: read(802, "g"..., 8192) = 8192 lseek(802, 918003712, SEEK_SET) = 918003712 read(802, "c"..., 8192) = 8192 lseek(802, 859136000, SEEK_SET) = 859136000 read(802, "a"..., 8192) = 8192 lseek(802, 919601152, SEEK_SET) = 919601152 read(802, "d"..., 8192) = 8192 lseek(802, 905101312, SEEK_SET) = 905101312 read(802, "c"..., 8192) = 8192 lseek(801, 507863040, SEEK_SET) = 507863040 read(801, "p"..., 8192) = 8192 lseek(802, 914235392, SEEK_SET) = 914235392 read(802, "c"..., 8192) = 8192 I was able to see great improvement without planner parameters by REINDEX the timestamp index. My theory is that the index/planner doesn't handle well the case of many tuples with same column value, and returns pages out of logical order. Reindex fixes that, rewriting the index data with pages in order (confirmed with pageinspect), which causes index scans to fetch heap data more or less monotonically (if not consecutively). strace shows that consecutive read()s are common (without intervening seeks). I gather this allows the OS readahead to kick in. Postgres seems to assume that the high degree of correlation of the table column seen in pg_stats is how it will get data from the index scan, which assumption seems to be very poor on what turns out to be a higly fragmented index. Is there a way to help it to understand otherwise?? Maybe this is a well-understood problem/deficiency; but, is there any reason why Btree scan can't sort result by ctid for index tuples with same column value (_bt_steppage() or btgettuple())? Or maybe the problem could be mitigated by changing the behavior during INESRT? In the meantime, I'll be implementing a reindex job. Thanks, Justin -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance