On Tue, 18 Aug 2020 at 18:44, Peter Geoghegan <pg@xxxxxxx> wrote: > > On Mon, Aug 17, 2020 at 11:44 PM Matthias van de Meent > <matthias.vandemeent@xxxxxxxxx> wrote: > > But, if the ordering of operator-class equal tuples is already > > system-defined, could the physical ordering of index tuples in a btree > > (with deduplication enabled for "unsafe" opclasses) be updated from > > [index_columns, tid] to [index_columns, > > image_compare(non_datum_equal_columns), tid], giving a stable sorting > > of opclass-equal and image-equal values and enabling safe consistent > > deduplication? > > The issue isn't the physical ordering. The issue is that we cannot > allow the implementation to destroy semantic differences among equal > datums. Deduplication does not need to destroy semantic differences? 'equal' can (in my book) mean: - 'opclass-equal', that is the opclass returns true for an equality check - 'binary equal' or 'datum-equal' (? maybe incorrect term), that is the unprocessed on-disk representations (datum image is the right term I believe?) of the compared values are indistinguishable. Runs of 'binary equal' datums can be freely deduplicated [0] when found. > We avoid deduplication with cases where two equal datums are > visibly different. For example, it would not be okay if we forgot that > your numeric datum was originally input as '5.000', and output '5' > later on. I agree on that point. But, isn't this display scale also stored in the 'datum image' [1] and could therefore be distinguished against for two opclass-equal values whilst deduplicating, only putting binary equal values in a posting list? e.g. all values generated through '0.000'::numeric -values can be compressed into one posting tuple without information loss, as the datum image is effectively the same for all of these (unless I have missed something), and distinct from the datum image of '0'::numeric. With enough equal values, you would eventually have a posting lists for each distinct datum image of equal values. Given that the above could work, the current btree tuple ordering is not optimized for opclass-equal but datum image-distinct values: ordering of opclass-equal values is currently determined only by tid, with as an example current ordering ['0.0', '0', '0.00', '0', '0.0', '0']. It would be more optimized for deduplication if that was stored as e.g. ['0', '0', '0', '0.0', '0.0', '0.00'], which is why I suggested to add an ordering by the datum image before the tid ordering. Additionally, this extra ordering also prevents the problem of [0] by never attempting an insertion of non-equal image datums in a posting list of otherwise equal values, as it would be ordered either before or after the posting list, never inside the list. > If we wanted to fix this for numeric, we'd have to invent a new > numeric datatype (called numeric2, say). That probably isn't as hard > as it sounds, since it could be part of the same B-Tree operator > family as numeric. It could also be implicitly cast to numeric. > > -- > Peter Geoghegan [0] Inserting a row in a deduplicated index with in, with TID ntid, can encounter a posting list of a opclass-equal but not datum image-equal tuples where the lowest TID of the posting list is less than ntid, and ntid is less than the highest TID of the posting list. This would require a posting list split to accomodate the new tuples' index entry in order to not lose data. [1] # a table with 1000 distinct values, each duplicated 700 times, but split into 7x100 'datum-equal' value groups that should be deduplicatable. SELECT (n % 1000 || '.' || repeat('0', n%7))::numeric AS num INTO numbers FROM generate_series(1, 700000) series(n); CREATE INDEX numeric_index ON numbers (num); ANALYZE numbers; SELECT num FROM numbers ORDER BY num LIMIT 700; -- this is an index-only scan, and results show numeric scale, so the information must be stored in the index.