Re: Questions about btree_gin vs btree_gist for low cardinality columns

Morris de Oryx <morrisdeoryx@xxxxxxxxx> · Sat, 1 Jun 2019 12:52:29 +1000

Jeremy's question is great, and really well presented. I can't answer his questions, but I am keenly interested in this subject as well. The links he provides lead to some really interesting and well-though-out pieces, well worth reading.

I'm going to try restating things in my own way in hopes of getting some good feedback and a basic question:

What are the best ways to index low cardinality values in Postgres?

For an example, imagine an address table with 100M US street addresses with two character state abbreviations. So, say there are around 60 values in there (the USPS is the mail system for a variety of US territories, possessions and friends in the Pacific.) Okay, so what's the best index type for state abbreviation? For the sake of argument, assume a normal distribution so something like FM (Federated States of Micronesia) is on a tail end and CA or NY are a whole lot more common.

A B-tree is obviously a pretty bad match for this sort of situation. It works, but B-trees are ideal for unique values, and super large for repeated values. Not getting into the details or Postgres specifics of various kinds of traditional B-trees. (I think B*?) Doesn't matter. You have a huge index because the index size is closely related to the number of rows, not the number of distinct values.

Alternatively, you could set up partial indexes for the distinct values, like so:

Running 10 Million PostgreSQL Indexes In Production (And Counting)
https://heap.io/blog/engineering/running-10-million-postgresql-indexes-in-production

Like Jeremy, I've wondered about GIN indexes for low-cardinality columns. Has anyone tried this out in PG 10 or 11? It sounds like a good idea. As I understand it, GIN indexes are something like a B-tree of unique values that link to another data structure, like a tree, bitmap, etc. So, in my imaginary example, there are 60 nodes for the state codes [internally there would be more for padding free nodes, evenly sized pages, etc....just pretend there are 60] and then linked to that, 60 data structures with the actual row references. Somehow.

It can imagine things going quite well with a GIN or btree_gin. I can also imagine that the secondary data structure could get bloaty, slow, and horrible. I've worked with a system which uses bitmaps as the secondary structure (at least some of the time), and it can work quite well...not sure how it's implemented in Postgres.

So, does anyone have any links or results about efficiently (space and/or time) indexing Boolean and other low-cardinality columns in Postgres? I'm on PG 11, but figure many are on 9.x or 10.x. 

References and results much appreciated. 

Thanks!