Re: Statistics and Multi-Column indexes

"Kevin Grittner" <Kevin.Grittner@xxxxxxxxxxxx> · Mon, 11 Jul 2011 15:32:05 -0500

lars <lhofhansl@xxxxxxxxx> wrote:

> We are maintaining a large multi tenant database where *all*
> tables have a tenant-id and all indexes and PKs lead with the
> tenant-id. Statistics and counts for the all other columns are
> only really meaningful within the context of the tenant they
> belong to.
> 
> There appear to be five options for me:
> 1. Using single column indexes on all interesting columns and rely
> on PostgreSQLs bitmap indexes to combine them (which are pretty
> cool).

Those are cool -- when programmers are skeptical that they should
just say *what* they want and let the database figure out how to get
it, I like to point out that this option is available to the
planner, but not to application programming.  Of course, there are a
great many other reason *I* find more compelling, but this one tends
to impress application programmers.

Assuming you keep the primary key as a multi-column index, this
seems like a good place to start.

> 2. Use multi column indexes and accept that sometimes Postgres
> pick the wrong index (because a non-tenant-id column might seem
> highly selective over the table, but it is not for a particular
> tenant - or vice versa).

If you have a lot of queries which access data based on matching
some set of columns, an occasional multicolumn index in addition to
the individual column index may be worth it.

You might want to avoid prepared statements, since these are planned
for the general case and can fall down badly for the extremes.

> 3. Use a functional index that combines multiple columns and only
> query via these, that causes statistics gathering for the
> expression. I.e. create index i on t((tenantid||column1)) and
> SELECT ... FROM t WHERE tenantid||column1 = '...'

I would hold off on that until I saw evidence of a specific need.

> 4. Play with n_distinct and/or set the statistics for the inner
> columns to some fixed values that lead to the plans that we want.

Try not to think in terms of "plans we want", but in terms of
modeling your costs so that, given your tables and indexes, the
PostgreSQL planner will do a good job of picking a fast plan.  You
normally need to tweak a few of the costing factors to match the
reality of your server and load.

> 5. Have a completely different schema and maybe a database per
> tenant.

> Are there any other options?

If most queries operate within a single tenant and you have less
than 100 tenants, you might think about partitioned tables.  Beyond
100, or if most queries need to look at many partitions, it becomes
problematic.

> I don't think #2 is actually an option. We have some tenants with
> many (sometimes 100s) millions of rows per table, and picking the
> wrong index would be disastrous.

You can always drop counter-productive or unused indexes.  I think
it's best to look at indexes, as much as possible, as database
tuning options, rather than something with any semantic meaning.  If
you're doing things right, an index will never change the result of
any query, so you are free to try different tunings and see what's
fastest with your schema and your production load.

We have tables with 100s of millions of rows where everything is
indexed by county number.  The biggest county has about 20% of the
rows; the smallest about 0.05%.  We generally do pretty well with
(1) and (2).  We do occasionally find it useful to create an index
with a WHERE clause, though.  You might want to consider those.

-Kevin

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance