Hi, I've finally found some time to redo some tests using PostgreSQL 11.3. Scenario is the following: 1.) add 10 M rows 2.) add another 30 M rows 3.) delete the first 10 M rows 4.) VACUUM 5.) REINDEX My goal is to find the most efficient way for UUID values to somewhat optimize for index page/node reclaim (or least amount of "bloat") without the need to execute a REINDEX (although we're currently doing that on partitions where we can be sure that no rows will be added/changed). I have tested the following: Table "uuid_v1": standard V1 UUID Table "uuid_v1_timestamp": standard V1 UUID with additional index using timestamp extract function Table "uuid_seq": sequential UUID from Thomas Vondra Table "uuid_serial": "serial" UUID from my Java implementation For the test, I created a little PostgreSQL extension for support functions using standard version 1 UUID's (mainly to extract the timestamp): https://github.com/ancoron/pg-uuid-ext My C skills really suck, so if someone could review that and tell if there are more efficient ways of doing certain things I'd be happy! And for generating the UUID values I've created another Java project: https://github.com/ancoron/java-uuid-serial Now some test results, which are somewhat in-line with what I've already seen in previous tests for PG 10, but less dramatic performance differences here: 1.) add 10 M rows Table: uuid_v1 Time: 33183.639 ms (00:33.184) Table: uuid_v1_timestamp Time: 49869.763 ms (00:49.870) Table: uuid_seq Time: 35794.906 ms (00:35.795) Table: uuid_serial Time: 22835.071 ms (00:22.835) As expected, the table with the additional function index is slowest but surprisingly the sequential UUID was not faster than the standard V1 UUID here. Still, serial version is the fastest. Primary key indexes after an ANALYZE: table_name | bloat | index_mb | table_mb -------------------+----------------+----------+---------- uuid_v1 | 128 MiB (32 %) | 395.570 | 422.305 uuid_v1_timestamp | 128 MiB (32 %) | 395.570 | 422.305 uuid_seq | 123 MiB (31 %) | 390.430 | 422.305 uuid_serial | 108 MiB (29 %) | 376.023 | 422.305 2.) add another 30 M rows Table: uuid_v1 Time: 136109.046 ms (02:16.109) Table: uuid_v1_timestamp Time: 193172.454 ms (03:13.172) Table: uuid_seq Time: 124713.530 ms (02:04.714) Table: uuid_serial Time: 78362.209 ms (01:18.362) Now the performance difference gets much more dramatic. Primary key indexes after an ANALYZE: table_name | bloat | index_mb | table_mb -------------------+----------------+----------+---------- uuid_v1 | 500 MiB (32 %) | 1571.039 | 1689.195 uuid_v1_timestamp | 500 MiB (32 %) | 1571.039 | 1689.195 uuid_seq | 492 MiB (31 %) | 1562.766 | 1689.195 uuid_serial | 433 MiB (29 %) | 1504.047 | 1689.195 Still no noticeable difference but I wonder why it looks like a fill-factor of 70 instead of the default 90. 3.) delete the first 10 M rows Implementations differ for the tables due to difference capabilities: DELETE FROM uuid_v1 WHERE id IN (select id from uuid_v1 limit 10000000); DELETE FROM uuid_v1_timestamp WHERE uuid_v1_timestamp(id) < '2019-03-09T07:58:02.056'; DELETE FROM uuid_seq WHERE id IN (select id from uuid_seq limit 10000000); DELETE FROM uuid_serial WHERE id < '1e942411-004c-1d50-0000-000000000000'; Table: uuid_v1 Time: 38308.299 ms (00:38.308) Table: uuid_v1_timestamp Time: 11589.941 ms (00:11.590) Table: uuid_seq Time: 37171.331 ms (00:37.171) Table: uuid_serial Time: 11694.893 ms (00:11.695) As expected, using the timestamp index directly of being able to compare on the UUID in a time-wise manner produces the best results. 4.) VACUUM Table: uuid_v1 Time: 69740.952 ms (01:09.741) Table: uuid_v1_timestamp Time: 67347.469 ms (01:07.347) Table: uuid_seq Time: 25832.355 ms (00:25.832) Table: uuid_serial Time: 12339.531 ms (00:12.340) This is pretty important to us as it consumes system resources and we have quite a lot of big tables. So my serial implementation seems to beat even the sequential one which was pretty surprising for me to see. Primary key indexes after an ANALYZE: table_name | bloat | index_mb | table_mb -------------------+----------------+----------+---------- uuid_v1 | 767 MiB (49 %) | 1571.039 | 1689.195 uuid_v1_timestamp | 768 MiB (49 %) | 1571.039 | 1689.195 uuid_seq | 759 MiB (49 %) | 1562.766 | 1689.195 uuid_serial | 700 MiB (47 %) | 1504.047 | 1689.195 OK, sadly no reclaim in any of them. 5.) REINDEX Table: uuid_v1 Time: 21549.860 ms (00:21.550) Table: uuid_v1_timestamp Time: 27367.817 ms (00:27.368) Table: uuid_seq Time: 19142.711 ms (00:19.143) Table: uuid_serial Time: 16889.807 ms (00:16.890) Even in this case it looks as if my implementation is faster than anything else - which I really don't get. Overall, as write performance is our major concern atm. without sacrificing read performance we'll continue using our "serial" implementation for the time being. Nevertheless, I am still puzzled by the index organization and how to keep it within a certain size in cases where you have a rolling window of value space to avoid unnecessary disk I/O (even if it is SSD). So I wonder if there is anything that can be done on the index side? I might implement a different opclass for the standard UUID to enable time-wise index sort order. This will naturally be very close to physical order but I doubt that this is something I can tell PostgreSQL, or? Cheers, Ancoron On 26/05/2019 11:01, Ancoron Luciferis wrote: > On 26/05/2019 03:09, Tomas Vondra wrote: >> On Sun, May 26, 2019 at 01:49:30AM +0200, Ancoron Luciferis wrote: >>> On 26/05/2019 00:14, Tomas Vondra wrote: >>>> On Sat, May 25, 2019 at 05:54:15PM -0400, Tom Lane wrote: >>>>> Ancoron Luciferis <ancoron.luciferis@xxxxxxxxxxxxxx> writes: >>>>>> On 25/05/2019 16:57, Tom Lane wrote: >>>>>>> (4) it in fact *wouldn't* do anything useful, because we'd still have >>>>>>> to sort UUIDs in the same order as today, meaning that btree index >>>>>>> behavior >>>>>>> would remain the same as before. Plus UUID comparison would get a >>>>>>> lot >>>>>>> more complicated and slower than it is now. >>>>> >>>>>> I get your first sentence, but not your second. I know that when >>>>>> changing the internal byte order we'd have to completed re-compute >>>>>> everything on-disk (from table to index data), but why would the >>>>>> sorting >>>>>> in the index have to be the same? >>>>> >>>>> Because we aren't going to change the existing sort order of UUIDs. >>>>> We have no idea what applications might be dependent on that. >>>>> >>>>> As Vitalii correctly pointed out, your beef is not with the physical >>>>> storage of UUIDs anyway: you just wish they'd sort differently, since >>>>> that is what determines the behavior of a btree index. But we aren't >>>>> going to change the sort ordering because that's an even bigger >>>>> compatibility break than changing the physical storage; it'd affect >>>>> application-visible semantics. >>>>> >>>>> What you might want to think about is creating a function that maps >>>>> UUIDs into an ordering that makes sense to you, and then creating >>>>> a unique index over that function instead of the raw UUIDs. That >>>>> would give the results you want without having to negotiate with the >>>>> rest of the world about whether it's okay to change the semantics >>>>> of type uuid. >>>>> >>>> >>>> FWIW that's essentially what I implemented as an extension some time >>>> ago. See [1] for a more detailed explanation and some benchmarks. >>> >>> Yes, I've seen that before. Pretty nice work you but together there and >>> I'll surely have a look at it but we certainly need the node id in >>> compliance with v1 UUID's so that's why we've been generating UUID's at >>> the application side from day 1. >>> >>>> >>>> The thing is - it's not really desirable to get perfectly ordered >>>> ordering, because that would mean we never get back to older parts of >>>> the index (so if you delete data, we'd never fill that space). >>> >>> Wouldn't this apply also to any sequential-looking index (e.g. on >>> serial)? >> >> Yes, it does apply to any index on sequential (ordered) data. If you >> delete data from the "old" part (but not all, so the pages don't get >> completely empty), that space is lost. It's available for new data, but >> if we only insert to "new" part of the index, that's useless. > > OK, thanks for clearing that up for me. > >> >>> The main issue with the UUID's is that it almost instantly >>> consumes a big part of the total value space (e.g. first value is >>> '01...' and second is coming as 'f3...') which I would assume not being >>> very efficient with btrees (space reclaim? - bloat). >>> >> >> I don't understand what you mean here. Perhaps you misunderstand how >> btree indexes grow? It's not like we allocate separate pages for >> different values/prefixes - we insert the data until a page gets full, >> then it's split in half. There is some dependency on the order in which >> the values are inserted, but AFAIK random order is generally fine. > > OK, I might not understand the basics of the btree implementation. Sorry > for that. > > However, one of our issues with standard v1 UUID's was bloat of the > indexes, although we kept only a few months of data in it. I think, this > was due to the pages still containing at least one value and not > reclaimed by vacuum. It just continued to grow. > > Now, as we have this different ever-increasing prefix, we still have > some constant growing but we see that whenever historic data get's > deleted (in a separate process), space get's reclaimed. > > >> >>> One of our major concerns is to keep index size small (VACUUM can't be >>> run every minute) to fit into memory next to a lot of others. >>> >> >> I don't think this has much to do with vacuum - I don't see how it's >> related to the ordering of generated UUID values. And I don't see where >> the "can't run vacuum every minute" comes from. > > OK, numbers (after VACUUM) which I really found strange using the query > from pgexperts [1]: > > index_name | bloat_pct | bloat_mb | index_mb | table_mb > ---------------------+-----------+----------+----------+---------- > uuid_v1_pkey | 38 | 363 | 965.742 | 950.172 > uuid_serial_pkey | 11 | 74 | 676.844 | 950.172 > uuid_serial_8_pkey | 46 | 519 | 1122.031 | 950.172 > uuid_serial_16_pkey | 39 | 389 | 991.195 | 950.172 > > ...where the "8" and "16" is a "shift" of the timestamp value, > implemented with: > > timestamp = (timestamp >>> (60 - shift)) | (timestamp << (shift + 4)) > > If someone could shed some light on why that is (the huge difference in > index sizes) I'd be happy. > >> >>> I've experimented with the rollover "prefix" myself but found that it >>> makes the index too big (same or larger index size than standard v1 >>> UUIDs) and VACUUM too slow (almost as slow as a standard V1 UUID), >>> although INSERT performance wasn't that bad, our sequential UUID's where >>> way faster (at least pre-generated and imported with COPY to eliminate >>> any value generation impact). >>> >> >> I very much doubt that has anything to do with the prefix. You'll need >> to share more details about how you did your tests. > > OK, I'll see if I can prepare something and publish it. > >> >> >> regards >> > > > Refs: > [1] > https://github.com/pgexperts/pgx_scripts/blob/master/bloat/index_bloat_check.sql >