Re: Questions about btree_gin vs btree_gist for low cardinality columns

Peter Geoghegan <pg@xxxxxxx> · Mon, 3 Jun 2019 16:17:22 -0700

On Sun, Jun 2, 2019 at 4:07 PM Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
> Anyway, I said to Jeremy in the hallway that it might not be that
> hard to bolt IOS support onto GIN for cases where the opclass is
> a non-subdividing one, but after looking at the code I'm less sure
> about that.  GIN hasn't even got an "amgettuple" code path, just
> "amgetbitmap", and a big part of the reason why is the need to merge
> results from the fastupdate pending list with results from the main
> index area.  Not sure how we could deal with that.

I suspect that GIN also avoids several other problems by only offer an
"amgetbitmap", and not an "amgettuple". For example, it doesn't have
to worry about things like numeric display scale, where a datum is
substantively different to another datum, while still being equal
according to opclass semantics (this is an example that I end up
citing in many discussions about indexing). I bet that there are a few
more of those beyond those two, that I haven't thought about.

> Anyway, the larger point here is that right now btree_gin is just a quick
> hack, and it seems like it might be worth putting some more effort into
> it, because the addition of duplicate-compression changes the calculus
> for whether it's useful.

There was also discussion about making nbtree support deduplication
during the hallway track. Jim Finnerty reminded me that there is a
patch from Anastasia that did deduplication in nbtree that didn't go
anywhere. Heikki independently talked about the possibility that he
would work on this project in the next release, without being prompted
by me.

I think that the fact that nbtree sorts duplicate entries in heap TID
order these days makes that worth looking into again. We can use
something like GIN's varbyte encoding process to compress duplicates
effectively. A lot of the problems (the numeric display scale problem,
pg_upgrade) can be avoided by defining deduplication as something that
happens on a best-effort basis. This is not the case within GIN, where
it's impossible for the main entry tree to have duplicates without at
least storing them in a posting list.

-- 
Peter Geoghegan