On 26/05/2019 03:09, Tomas Vondra wrote: > On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote: >> On 26/05/2019 00:14, Tomas Vondra wrote: >>> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote: >>>> Ancoron Luciferis <ancoron.luciferis@xxxxxxxxxxxxxx> writes: >>>>> On 25/05/2019 16:57, Tom Lane wrote: >>>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have >>>>>> to sort UUIDs in the same order as today, meaning that btree index >>>>>> behavior >>>>>> would remain the same as before. Plus UUID comparison would get a >>>>>> lot >>>>>> more complicated and slower than it is now. >>>> >>>>> I get your first sentence, but not your second. I know that when >>>>> changing the internal byte order we'd have to completed re-compute >>>>> everything on-disk (from table to index data), but why would the >>>>> sorting >>>>> in the index have to be the same? >>>> >>>> Because we aren't going to change the existing sort order of UUIDs. >>>> We have no idea what applications might be dependent on that. >>>> >>>> As Vitalii correctly pointed out, your beef is not with the physical >>>> storage of UUIDs anyway: you just wish they'd sort differently, since >>>> that is what determines the behavior of a btree index. But we aren't >>>> going to change the sort ordering because that's an even bigger >>>> compatibility break than changing the physical storage; it'd affect >>>> application-visible semantics. >>>> >>>> What you might want to think about is creating a function that maps >>>> UUIDs into an ordering that makes sense to you, and then creating >>>> a unique index over that function instead of the raw UUIDs. That >>>> would give the results you want without having to negotiate with the >>>> rest of the world about whether it's okay to change the semantics >>>> of type uuid. >>>> >>> >>> FWIW that's essentially what I implemented as an extension some time >>> ago. See [1] for a more detailed explanation and some benchmarks. >> >> Yes, I've seen that before. Pretty nice work you but together there and >> I'll surely have a look at it but we certainly need the node id in >> compliance with v1 UUID's so that's why we've been generating UUID's at >> the application side from day 1. >> >>> >>> The thing is - it's not really desirable to get perfectly ordered >>> ordering, because that would mean we never get back to older parts of >>> the index (so if you delete data, we'd never fill that space). >> >> Wouldn't this apply also to any sequential-looking index (e.g. on >> serial)? > > Yes, it does apply to any index on sequential (ordered) data. If you > delete data from the "old" part (but not all, so the pages don't get > completely empty), that space is lost. It's available for new data, but > if we only insert to "new" part of the index, that's useless. OK, thanks for clearing that up for me. > >> The main issue with the UUID's is that it almost instantly >> consumes a big part of the total value space (e.g. first value is >> '01...' and second is coming as 'f3...') which I would assume not being >> very efficient with btrees (space reclaim? - bloat). >> > > I don't understand what you mean here. Perhaps you misunderstand how > btree indexes grow? It's not like we allocate separate pages for > different values/prefixes - we insert the data until a page gets full, > then it's split in half. There is some dependency on the order in which > the values are inserted, but AFAIK random order is generally fine. OK, I might not understand the basics of the btree implementation. Sorry for that. However, one of our issues with standard v1 UUID's was bloat of the indexes, although we kept only a few months of data in it. I think, this was due to the pages still containing at least one value and not reclaimed by vacuum. It just continued to grow. Now, as we have this different ever-increasing prefix, we still have some constant growing but we see that whenever historic data get's deleted (in a separate process), space get's reclaimed. > >> One of our major concerns is to keep index size small (VACUUM can't be >> run every minute) to fit into memory next to a lot of others. >> > > I don't think this has much to do with vacuum - I don't see how it's > related to the ordering of generated UUID values. And I don't see where > the "can't run vacuum every minute" comes from. OK, numbers (after VACUUM) which I really found strange using the query from pgexperts [1]: index_name | bloat_pct | bloat_mb | index_mb | table_mb ---------------------+-----------+----------+----------+---------- uuid_v1_pkey | 38 | 363 | 965.742 | 950.172 uuid_serial_pkey | 11 | 74 | 676.844 | 950.172 uuid_serial_8_pkey | 46 | 519 | 1122.031 | 950.172 uuid_serial_16_pkey | 39 | 389 | 991.195 | 950.172 ...where the "8" and "16" is a "shift" of the timestamp value, implemented with: timestamp = (timestamp >>> (60 - shift)) | (timestamp << (shift + 4)) If someone could shed some light on why that is (the huge difference in index sizes) I'd be happy. > >> I've experimented with the rollover "prefix" myself but found that it >> makes the index too big (same or larger index size than standard v1 >> UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID), >> although INSERT performance wasn't that bad, our sequential UUID's where >> way faster (at least pre-generated and imported with COPY to eliminate >> any value generation impact). >> > > I very much doubt that has anything to do with the prefix. You'll need > to share more details about how you did your tests. OK, I'll see if I can prepare something and publish it. > > > regards > Refs: [1] https://github.com/pgexperts/pgx_scripts/blob/master/bloat/index_bloat_check.sql