On 06/07/2019 17:57, Tomas Vondra wrote: > On Sat, Jul 06, 2019 at 05:35:33PM +0200, Ancoron Luciferis wrote: >> On 06/07/2019 15:38, Tomas Vondra wrote: >>> On Sat, Jul 06, 2019 at 11:02:27AM +0200, Ancoron Luciferis wrote: >>>> Hi, >>>> >>>> I've been wondering whether it is possible somehow to have the standard >>>> column statistics to respect a certain operator class? >>>> >>>> The reason why I am asking for this is that I have a UUID column with a >>>> unique index at it using a custom operator class which implies a >>>> different sort order than for the default UUID operator class. >>>> >>>> This results into planner mistakes when determining whether to use the >>>> index for row selection or not. Too often it falls back into sequential >>>> scan due to this. >>>> >>> >>> Can you share an example demonstrating the issue? >>> >>> >>> regards >>> >> >> Yes, I have an opclass as follows: >> >> CREATE OPERATOR CLASS uuid_timestamp_ops FOR TYPE uuid >> USING btree AS >> OPERATOR 1 <*, >> OPERATOR 1 <~ (uuid, timestamp with time zone), >> OPERATOR 2 <=*, >> OPERATOR 2 <=~ (uuid, timestamp with time zone), >> OPERATOR 3 =, >> OPERATOR 3 =~ (uuid, timestamp with time zone), >> OPERATOR 4 >=*, >> OPERATOR 4 >=~ (uuid, timestamp with time zone), >> OPERATOR 5 >*, >> OPERATOR 5 >~ (uuid, timestamp with time zone), >> FUNCTION 1 uuid_timestamp_cmp(uuid, uuid), >> FUNCTION 1 uuid_timestamp_only_cmp(uuid, timestamp >> with time zone), >> FUNCTION 2 uuid_timestamp_sortsupport(internal) >> ; >> >> ...and e.g. operator "<*" is defined as: >> >> CREATE FUNCTION uuid_timestamp_lt(uuid, uuid) >> RETURNS bool >> AS 'MODULE_PATHNAME', 'uuid_timestamp_lt' >> LANGUAGE C >> IMMUTABLE >> LEAKPROOF >> STRICT >> PARALLEL SAFE; >> >> COMMENT ON FUNCTION uuid_timestamp_lt(uuid, uuid) IS 'lower than'; >> >> CREATE OPERATOR <* ( >> LEFTARG = uuid, >> RIGHTARG = uuid, >> PROCEDURE = uuid_timestamp_lt, >> COMMUTATOR = '>*', >> NEGATOR = '>=*', >> RESTRICT = scalarltsel, >> JOIN = scalarltjoinsel >> ); >> >> >> The function "uuid_timestamp_lt" is basically defined as follows: >> 1. if not version 1 UUID fallback to standard uuid compare >> 2. extract timestamp values and compare >> 3. if equal timestamps fallback to standard uuid compare >> >> ...so that a chronological order is established. >> >> >> The test table is created as follows: >> >> CREATE TABLE uuid_v1_ext (id uuid); >> CREATE UNIQUE INDEX idx_uuid_v1_ext ON uuid_v1_ext (id >> uuid_timestamp_ops); >> >> >> The values for "histogram_bounds" of the test table look like this (due >> to the default sort order for standard type UUID): >> >> 00003789-97bf-11e9-b6bb-e03f49f7f733 >> 008b88f8-6deb-11e9-901a-e03f4947f477 >> 010a8b22-586a-11e9-8258-e03f49ce78f3 >> ... >> 6f682e68-978d-11e9-901a-e03f4947f477 >> 6ff412ee-926f-11e9-901a-e03f4947f477 >> 7079ffe2-642f-11e9-b0cc-e03f49e7fd3b >> 70ffaeca-4645-11e9-adf9-e03f494677fb >> ... >> fef26b41-9b9d-11e9-b0cc-e03f49e7fd3b >> ff779ce8-9e52-11e9-8258-e03f49ce78f3 >> ffff6bfc-4de4-11e9-b0d4-e03f49d6f6bf >> >> ...and I think that's where the planner gets the decision for a query >> such as: >> >> DELETE FROM uuid_v1_ext WHERE id <* >> '4bdf6f81-56ad-11e9-8258-e03f49ce78f3'; >> >> ...which then get's executed as sequential scan instead of an index scan. >> >> I was also thinking about changing the selectivity function used by the >> custom operator, but I didn't find any hints how to implement that >> without duplicating a lot of internal code. >> > > Not sure, I'm not very familiar with this code, so I'd have to play with > it and try things. But that's hard when I don't have any code. Would it > be possible to share a small self-contained test case? My test uses historically generated UUID's and is currently not automated (generates ~120M UUID's over a period of ~4 months). The tool I am using to generate the UUID's is a Java tool, so not very automation friendly (using Maven, which downloads a lot at first build time). The resulting data is ~4 GiB in size, but here is a guide I use for my manual testing: https://gist.github.com/ancoron/b08ac4b1ceafda2a38ff12030c011385 Please note that my testing involves 4 SSD's (to avoid too much mixed I/O but not for functionality): 1.) system + WAL 2.) generated UUID files (for reading) 3.) table data (tablespace "fast") 4.) index data (tablespace "faster") > > I wonder what does uuid_timestamp_cmp do? I suppose it first compares by > a timestamp extracted from the UUID, right? exactly: https://github.com/ancoron/pg-uuid-ext/blob/master/uuid_ext.c#L234 > > It'd be interesting to see > > (a) statistics for the column from pg_stats, both for the table and > index (which should have been built using the custom opclass, I think). For the table column: schemaname | public tablename | uuid_v1_ext attname | id inherited | f null_frac | 0 avg_width | 16 n_distinct | -1 most_common_vals | most_common_freqs | histogram_bounds | {00003789-97bf-11e9-b6bb-e03f49f7f733, ... correlation | 0.00448091 most_common_elems | most_common_elem_freqs | elem_count_histogram | There is no statistic for the index (or I don't know how to retrieve it). > > (b) EXPLAIN ANALYZE for queries with your opclass, and perhaps with the > default one (that can't use the timestamp condition, but it should be > possible to generate smallers/largest uuid for a timestamp). After some further testing it seems that I am comparing apples with bananas to a certain extend as I also have another index using a uuid timestamp extraction function and then using a timestamp for selecting the entries to delete. For uuid 4bdf6f81-56ad-11e9-8258-e03f49ce78f3, the query then looks as follows: DELETE FROM uuid_v1_timestamp WHERE uuid_v1_timestamp(id) < '2019-04-04 07:43:11.3776'; ...resulting in a plan like: Delete on uuid_v1_timestamp (cost=0.57..779651.49 rows=30077367 width=6) -> Index Scan using idx_uuid_v1_timestamp on uuid_v1_timestamp (cost=0.57..779651.49 rows=30077367 width=6) Index Cond: (uuid_v1_timestamp(id) < '2019-04-04 07:43:11.3776+00'::timestamp with time zone) So, as my opclass basically does the same internally, I was expecting the same behavior but it turned out that the statistics are just off a bit. When changing the timestamp to a lower value (resulting in less rows selected), the planner correctly uses an index scan, e.g.: Delete on uuid_v1_ext (cost=0.57..767334.76 rows=1829994 width=6) -> Index Scan using idx_uuid_v1_ext on uuid_v1_ext (cost=0.57..767334.76 rows=1829994 width=6) Index Cond: (id <* '4e91eb0a-4e91-11e9-83fd-e03f49ef76f7'::uuid) ...despite the fact that the bare uuid value is pretty high (the timestamp value is 2019-03-25 00:00:28.846626+00). What I can see is that the row estimates are really off here: Seq Scan on uuid_v1_ext (cost=0.00..2184455.80 rows=46969870 width=16) (actual time=0.029..9372.892 rows=30000000 loops=1) Filter: (id <* '4bdf6f81-56ad-11e9-8258-e03f49ce78f3'::uuid) Rows Removed by Filter: 92000000 Planning Time: 0.152 ms Execution Time: 10139.718 ms vs. Index Only Scan using idx_uuid_v1_ext on uuid_v1_ext (cost=0.57..767334.76 rows=1829994 width=16) (actual time=0.042..3255.211 rows=19709001 loops=1) Index Cond: (id <* '4e91eb0a-4e91-11e9-83fd-e03f49ef76f7'::uuid) Heap Fetches: 19709001 Planning Time: 0.182 ms Execution Time: 3763.380 ms ^^ last case off by a factor of 10? However, I think the lower rows range for switching to an index scan may also be influenced by the index width (16 byte uuid vs. 8 byte timestamp). Or am I wrong here? > > BTW which PostgreSQL version is this? I am testing mainly on 11.4. I did do any testing on 12 beta so far. > > regards >