Re: UUID v1 optimizations...

Ancoron Luciferis <ancoron.luciferis@xxxxxxxxxxxxxx> · Sun, 26 May 2019 11:01:36 +0200

On 26/05/2019 03:09, Tomas Vondra wrote:
> On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote:
>> On 26/05/2019 00:14, Tomas Vondra wrote:
>>> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote:
>>>> Ancoron Luciferis <ancoron.luciferis@xxxxxxxxxxxxxx> writes:
>>>>> On 25/05/2019 16:57, Tom Lane wrote:
>>>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have
>>>>>> to sort UUIDs in the same order as today, meaning that btree index
>>>>>> behavior
>>>>>> would remain the same as before.  Plus UUID comparison would get a
>>>>>> lot
>>>>>> more complicated and slower than it is now.
>>>>
>>>>> I get your first sentence, but not your second. I know that when
>>>>> changing the internal byte order we'd have to completed re-compute
>>>>> everything on-disk (from table to index data), but why would the
>>>>> sorting
>>>>> in the index have to be the same?
>>>>
>>>> Because we aren't going to change the existing sort order of UUIDs.
>>>> We have no idea what applications might be dependent on that.
>>>>
>>>> As Vitalii correctly pointed out, your beef is not with the physical
>>>> storage of UUIDs anyway: you just wish they'd sort differently, since
>>>> that is what determines the behavior of a btree index.  But we aren't
>>>> going to change the sort ordering because that's an even bigger
>>>> compatibility break than changing the physical storage; it'd affect
>>>> application-visible semantics.
>>>>
>>>> What you might want to think about is creating a function that maps
>>>> UUIDs into an ordering that makes sense to you, and then creating
>>>> a unique index over that function instead of the raw UUIDs.  That
>>>> would give the results you want without having to negotiate with the
>>>> rest of the world about whether it's okay to change the semantics
>>>> of type uuid.
>>>>
>>>
>>> FWIW that's essentially what I implemented as an extension some time
>>> ago. See [1] for a more detailed explanation and some benchmarks.
>>
>> Yes, I've seen that before. Pretty nice work you but together there and
>> I'll surely have a look at it but we certainly need the node id in
>> compliance with v1 UUID's so that's why we've been generating UUID's at
>> the application side from day 1.
>>
>>>
>>> The thing is - it's not really desirable to get perfectly ordered
>>> ordering, because that would mean we never get back to older parts of
>>> the index (so if you delete data, we'd never fill that space).
>>
>> Wouldn't this apply also to any sequential-looking index (e.g. on
>> serial)?
> 
> Yes, it does apply to any index on sequential (ordered) data. If you
> delete data from the "old" part (but not all, so the pages don't get
> completely empty), that space is lost. It's available for new data, but
> if we only insert to "new" part of the index, that's useless.

OK, thanks for clearing that up for me.

> 
>> The main issue with the UUID's is that it almost instantly
>> consumes a big part of the total value space (e.g. first value is
>> '01...' and second is coming as 'f3...') which I would assume not being
>> very efficient with btrees (space reclaim? - bloat).
>>
> 
> I don't understand what you mean here. Perhaps you misunderstand how
> btree indexes grow? It's not like we allocate separate pages for
> different values/prefixes - we insert the data until a page gets full,
> then it's split in half. There is some dependency on the order in which
> the values are inserted, but AFAIK random order is generally fine.

OK, I might not understand the basics of the btree implementation. Sorry
for that.

However, one of our issues with standard v1 UUID's was bloat of the
indexes, although we kept only a few months of data in it. I think, this
was due to the pages still containing at least one value and not
reclaimed by vacuum. It just continued to grow.

Now, as we have this different ever-increasing prefix, we still have
some constant growing but we see that whenever historic data get's
deleted (in a separate process), space get's reclaimed.

> 
>> One of our major concerns is to keep index size small (VACUUM can't be
>> run every minute) to fit into memory next to a lot of others.
>>
> 
> I don't think this has much to do with vacuum - I don't see how it's
> related to the ordering of generated UUID values. And I don't see where
> the "can't run vacuum every minute" comes from.

OK, numbers (after VACUUM) which I really found strange using the query
from pgexperts [1]:

     index_name      | bloat_pct | bloat_mb | index_mb | table_mb
---------------------+-----------+----------+----------+----------
 uuid_v1_pkey        |        38 |      363 |  965.742 |  950.172
 uuid_serial_pkey    |        11 |       74 |  676.844 |  950.172
 uuid_serial_8_pkey  |        46 |      519 | 1122.031 |  950.172
 uuid_serial_16_pkey |        39 |      389 |  991.195 |  950.172

...where the "8" and "16" is a "shift" of the timestamp value,
implemented with:

timestamp = (timestamp >>> (60 - shift)) | (timestamp << (shift + 4))

If someone could shed some light on why that is (the huge difference in
index sizes) I'd be happy.

> 
>> I've experimented with the rollover "prefix" myself but found that it
>> makes the index too big (same or larger index size than standard v1
>> UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID),
>> although INSERT performance wasn't that bad, our sequential UUID's where
>> way faster (at least pre-generated and imported with COPY to eliminate
>> any value generation impact).
>>
> 
> I very much doubt that has anything to do with the prefix. You'll need
> to share more details about how you did your tests.

OK, I'll see if I can prepare something and publish it.

> 
> 
> regards
> 

Refs:
[1]
https://github.com/pgexperts/pgx_scripts/blob/master/bloat/index_bloat_check.sql