Re: Custom opclass for column statistics?

Ancoron Luciferis <ancoron.luciferis@xxxxxxxxxxxxxx> · Sun, 7 Jul 2019 01:17:09 +0200

On 06/07/2019 17:57, Tomas Vondra wrote:
> On Sat, Jul 06, 2019 at 05:35:33PM +0200, Ancoron Luciferis wrote:
>> On 06/07/2019 15:38, Tomas Vondra wrote:
>>> On Sat, Jul 06, 2019 at 11:02:27AM +0200, Ancoron Luciferis wrote:
>>>> Hi,
>>>>
>>>> I've been wondering whether it is possible somehow to have the standard
>>>> column statistics to respect a certain operator class?
>>>>
>>>> The reason why I am asking for this is that I have a UUID column with a
>>>> unique index at it using a custom operator class which implies a
>>>> different sort order than for the default UUID operator class.
>>>>
>>>> This results into planner mistakes when determining whether to use the
>>>> index for row selection or not. Too often it falls back into sequential
>>>> scan due to this.
>>>>
>>>
>>> Can you share an example demonstrating the issue?
>>>
>>>
>>> regards
>>>
>>
>> Yes, I have an opclass as follows:
>>
>> CREATE OPERATOR CLASS uuid_timestamp_ops FOR TYPE uuid
>>    USING btree AS
>>        OPERATOR        1       <*,
>>        OPERATOR        1       <~ (uuid, timestamp with time zone),
>>        OPERATOR        2       <=*,
>>        OPERATOR        2       <=~ (uuid, timestamp with time zone),
>>        OPERATOR        3       =,
>>        OPERATOR        3       =~ (uuid, timestamp with time zone),
>>        OPERATOR        4       >=*,
>>        OPERATOR        4       >=~ (uuid, timestamp with time zone),
>>        OPERATOR        5       >*,
>>        OPERATOR        5       >~ (uuid, timestamp with time zone),
>>        FUNCTION        1       uuid_timestamp_cmp(uuid, uuid),
>>        FUNCTION        1       uuid_timestamp_only_cmp(uuid, timestamp
>> with time zone),
>>        FUNCTION        2       uuid_timestamp_sortsupport(internal)
>> ;
>>
>> ...and e.g. operator "<*" is defined as:
>>
>> CREATE FUNCTION uuid_timestamp_lt(uuid, uuid)
>> RETURNS bool
>> AS 'MODULE_PATHNAME', 'uuid_timestamp_lt'
>> LANGUAGE C
>> IMMUTABLE
>> LEAKPROOF
>> STRICT
>> PARALLEL SAFE;
>>
>> COMMENT ON FUNCTION uuid_timestamp_lt(uuid, uuid) IS 'lower than';
>>
>> CREATE OPERATOR <* (
>>    LEFTARG = uuid,
>>    RIGHTARG = uuid,
>>    PROCEDURE = uuid_timestamp_lt,
>>    COMMUTATOR = '>*',
>>    NEGATOR = '>=*',
>>    RESTRICT = scalarltsel,
>>    JOIN = scalarltjoinsel
>> );
>>
>>
>> The function "uuid_timestamp_lt" is basically defined as follows:
>> 1. if not version 1 UUID fallback to standard uuid compare
>> 2. extract timestamp values and compare
>> 3. if equal timestamps fallback to standard uuid compare
>>
>> ...so that a chronological order is established.
>>
>>
>> The test table is created as follows:
>>
>> CREATE TABLE uuid_v1_ext (id uuid);
>> CREATE UNIQUE INDEX idx_uuid_v1_ext ON uuid_v1_ext (id
>> uuid_timestamp_ops);
>>
>>
>> The values for "histogram_bounds" of the test table look like this (due
>> to the default sort order for standard type UUID):
>>
>> 00003789-97bf-11e9-b6bb-e03f49f7f733
>> 008b88f8-6deb-11e9-901a-e03f4947f477
>> 010a8b22-586a-11e9-8258-e03f49ce78f3
>> ...
>> 6f682e68-978d-11e9-901a-e03f4947f477
>> 6ff412ee-926f-11e9-901a-e03f4947f477
>> 7079ffe2-642f-11e9-b0cc-e03f49e7fd3b
>> 70ffaeca-4645-11e9-adf9-e03f494677fb
>> ...
>> fef26b41-9b9d-11e9-b0cc-e03f49e7fd3b
>> ff779ce8-9e52-11e9-8258-e03f49ce78f3
>> ffff6bfc-4de4-11e9-b0d4-e03f49d6f6bf
>>
>> ...and I think that's where the planner gets the decision for a query
>> such as:
>>
>> DELETE FROM uuid_v1_ext WHERE id <*
>> '4bdf6f81-56ad-11e9-8258-e03f49ce78f3';
>>
>> ...which then get's executed as sequential scan instead of an index scan.
>>
>> I was also thinking about changing the selectivity function used by the
>> custom operator, but I didn't find any hints how to implement that
>> without duplicating a lot of internal code.
>>
> 
> Not sure, I'm not very familiar with this code, so I'd have to play with
> it and try things. But that's hard when I don't have any code. Would it
> be possible to share a small self-contained test case?

My test uses historically generated UUID's and is currently not
automated (generates ~120M UUID's over a period of ~4 months). The tool
I am using to generate the UUID's is a Java tool, so not very automation
friendly (using Maven, which downloads a lot at first build time).

The resulting data is ~4 GiB in size, but here is a guide I use for my
manual testing:

https://gist.github.com/ancoron/b08ac4b1ceafda2a38ff12030c011385

Please note that my testing involves 4 SSD's (to avoid too much mixed
I/O but not for functionality):
1.) system + WAL
2.) generated UUID files (for reading)
3.) table data (tablespace "fast")
4.) index data (tablespace "faster")

> 
> I wonder what does uuid_timestamp_cmp do? I suppose it first compares by
> a timestamp extracted from the UUID, right?

exactly:

https://github.com/ancoron/pg-uuid-ext/blob/master/uuid_ext.c#L234

> 
> It'd be interesting to see
> 
> (a) statistics for the column from pg_stats, both for the table and
> index (which should have been built using the custom opclass, I think).

For the table column:

schemaname             | public
tablename              | uuid_v1_ext
attname                | id
inherited              | f
null_frac              | 0
avg_width              | 16
n_distinct             | -1
most_common_vals       |
most_common_freqs      |
histogram_bounds       | {00003789-97bf-11e9-b6bb-e03f49f7f733, ...
correlation            | 0.00448091
most_common_elems      |
most_common_elem_freqs |
elem_count_histogram   |

There is no statistic for the index (or I don't know how to retrieve it).

> 
> (b) EXPLAIN ANALYZE for queries with your opclass, and perhaps with the
> default one (that can't use the timestamp condition, but it should be
> possible to generate smallers/largest uuid for a timestamp).

After some further testing it seems that I am comparing apples with
bananas to a certain extend as I also have another index using a uuid
timestamp extraction function and then using a timestamp for selecting
the entries to delete.

For uuid 4bdf6f81-56ad-11e9-8258-e03f49ce78f3, the query then looks as
follows:

DELETE FROM uuid_v1_timestamp WHERE uuid_v1_timestamp(id) < '2019-04-04
07:43:11.3776';

...resulting in a plan like:

 Delete on uuid_v1_timestamp  (cost=0.57..779651.49 rows=30077367 width=6)
   ->  Index Scan using idx_uuid_v1_timestamp on uuid_v1_timestamp
(cost=0.57..779651.49 rows=30077367 width=6)
         Index Cond: (uuid_v1_timestamp(id) < '2019-04-04
07:43:11.3776+00'::timestamp with time zone)

So, as my opclass basically does the same internally, I was expecting
the same behavior but it turned out that the statistics are just off a
bit. When changing the timestamp to a lower value (resulting in less
rows selected), the planner correctly uses an index scan, e.g.:

 Delete on uuid_v1_ext  (cost=0.57..767334.76 rows=1829994 width=6)
   ->  Index Scan using idx_uuid_v1_ext on uuid_v1_ext
(cost=0.57..767334.76 rows=1829994 width=6)
         Index Cond: (id <* '4e91eb0a-4e91-11e9-83fd-e03f49ef76f7'::uuid)

...despite the fact that the bare uuid value is pretty high (the
timestamp value is 2019-03-25 00:00:28.846626+00).

What I can see is that the row estimates are really off here:

 Seq Scan on uuid_v1_ext  (cost=0.00..2184455.80 rows=46969870 width=16)
(actual time=0.029..9372.892 rows=30000000 loops=1)
   Filter: (id <* '4bdf6f81-56ad-11e9-8258-e03f49ce78f3'::uuid)
   Rows Removed by Filter: 92000000
 Planning Time: 0.152 ms
 Execution Time: 10139.718 ms

vs.

 Index Only Scan using idx_uuid_v1_ext on uuid_v1_ext
(cost=0.57..767334.76 rows=1829994 width=16) (actual
time=0.042..3255.211 rows=19709001 loops=1)
   Index Cond: (id <* '4e91eb0a-4e91-11e9-83fd-e03f49ef76f7'::uuid)
   Heap Fetches: 19709001
 Planning Time: 0.182 ms
 Execution Time: 3763.380 ms

^^ last case off by a factor of 10?

However, I think the lower rows range for switching to an index scan may
also be influenced by the index width (16 byte uuid vs. 8 byte
timestamp). Or am I wrong here?

> 
> BTW which PostgreSQL version is this?

I am testing mainly on 11.4. I did do any testing on 12 beta so far.

> 
> regards
>