Re: index fragmentation on insert-only table with non-unique column

Jeff Janes <jeff.janes@xxxxxxxxx> · Tue, 24 May 2016 23:23:48 -0700

On Tue, May 24, 2016 at 10:39 AM, Justin Pryzby <pryzby@xxxxxxxxxxxxx> wrote:
> Summary: Non-unique btree indices are returning CTIDs for rows with same
> value of indexed column not in logical order, imposing a high performance
> penalty.
>
> Running PG 9.5.3 now, we have a time-based partitions of append-only tables
> with data loaded from other sources.  The tables are partitioned by time, and
> timestamp column has an non-unique, not-null btree index.
>
> The child tables are each ~75GB and expected to keep growing.  For a child
> table with a week's worth of data:
> relpages  | 11255802
> reltuples | 5.90502e+07
>
> The data is loaded shortly after it's available, so have high correlation in
> pg_statistic:
> [pryzbyj@viaero ~]$ psql ts -c "SELECT tablename, correlation, n_distinct FROM pg_stats s JOIN pg_class c ON (c.relname=s.tablename) WHERE tablename LIKE 'cdrs_huawei_pgwrecord%' AND attname='recordopeningtime' ORDER BY 1" |head
>             tablename             | correlation | n_distinct
> ----------------------------------+-------------+------------
>  cdrs_huawei_pgwrecord            |    0.999997 |     102892
>  cdrs_huawei_pgwrecord_2016_02_15 |    0.999658 |      96145
>  cdrs_huawei_pgwrecord_2016_02_22 |    0.999943 |      91916
>  cdrs_huawei_pgwrecord_2016_02_29 |    0.997219 |      50341
>  cdrs_huawei_pgwrecord_2016_03_01 |    0.999947 |      97485
>
> But note the non-uniqueness of the index column:
> ts=# SELECT recordopeningtime, COUNT(1) FROM cdrs_huawei_pgwrecord WHERE recordopeningtime>='2016-05-21' AND recordopeningtime<'2016-05-22' GROUP BY 1 ORDER BY 2 DESC;
>   recordopeningtime  | count
> ---------------------+-------
>  2016-05-21 12:17:29 |   176
>  2016-05-21 12:17:25 |   171
>  2016-05-21 13:11:33 |   170
>  2016-05-21 10:20:02 |   169
>  2016-05-21 11:30:02 |   167
> [...]

That is not that much duplication.  You aren't going to have dozens or
hundreds of leaf pages all with equal values.   (and you only showed
the most highly duplicated ones, presumably the average is much less)

> We have an daily analytic query which processes the previous day's data.  For
> new child tables, with only 1 days data loaded, this runs in ~30min, and for
> child tables with an entire week's worth of data loaded, takes several hours
> (even though both queries process the same amount of data).

For an append only table, why would the first day of a new partition
be any less fragmented than that same day would be a week from now?
Are you sure it isn't just that your week-old data has all been aged
out of the cache?

> First, I found I was able to get 30-50min query results on full week's table by
> prefering a seq scan to an index scan.  The row estimates seemed fine, and the
> only condition is the timestamp, so the planner's use of index scan is as
> expected.

Can you show us the query?  I would expect a bitmap scan of the index
(which would do what you want, but even more so), instead.

>
> AFAICT what's happening is that the index scan was returning pages
> nonsequentially.  strace-ing the backend showed alternating lseek()s and
> read()s, with the offsets not consistently increasing (nor consistently
> decreasing):
> % sudo strace -p 25588 2>&1 |grep -m9 'lseek(773'
> lseek(773, 1059766272, SEEK_SET)        = 1059766272
> lseek(773, 824926208, SEEK_SET)         = 824926208
> lseek(773, 990027776, SEEK_SET)         = 990027776
> lseek(773, 990330880, SEEK_SET)         = 990330880
> lseek(773, 1038942208, SEEK_SET)        = 1038942208
> lseek(773, 1059856384, SEEK_SET)        = 1059856384
> lseek(773, 977305600, SEEK_SET)         = 977305600
> lseek(773, 990347264, SEEK_SET)         = 990347264
> lseek(773, 871096320, SEEK_SET)         = 871096320
>
> .. and consecutive read()s being rare:
> read(802, "g"..., 8192)                 = 8192
> lseek(802, 918003712, SEEK_SET)         = 918003712
> read(802, "c"..., 8192)                 = 8192
> lseek(802, 859136000, SEEK_SET)         = 859136000
> read(802, "a"..., 8192)                 = 8192
> lseek(802, 919601152, SEEK_SET)         = 919601152
> read(802, "d"..., 8192)                 = 8192
> lseek(802, 905101312, SEEK_SET)         = 905101312
> read(802, "c"..., 8192)                 = 8192
> lseek(801, 507863040, SEEK_SET)         = 507863040
> read(801, "p"..., 8192)                 = 8192
> lseek(802, 914235392, SEEK_SET)         = 914235392
> read(802, "c"..., 8192)                 = 8192

Which of those are the table, and which the index?

Something doesn't add up here.  How could an index of an append-only
table possibly become that fragmented, when the highest amount of key
duplication is about 170?

Cheers,

Jeff

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance