Howdy all,
We maintain a hosted multi-tenant system for a large number of users. Each user has what we call an "organization" or "org" which is their sandbox. In that organization they can manage contacts and define custom fields on those contacts, which then have values per contact.
We have thousands of organizations, some organizations have hundreds of custom fields. Some organizations have millions of contacts. We have tens of millions of values across organizations, so relatively sparse with a large number of tiny / unused organizations. All these numbers will continue to grow, though single organizations greater than 10 million contacts is unlikely.
What is the recommended way of modeling this in order to allow performant queries on the custom fields?
Right now we basically use an EAV model (a single contact_fields table) with compound indexes that join the field key and field value so as to force locality in the index. That has worked ok, but occasionally falls on its face when the query planner makes a wrong guess due to the (obviously skewed) statistics it has to work with. Multi-field queries can also get painful, especially on the largest organizations with millions of contacts.
What other approaches should we be looking at? We've brainstormed different approaches but would love some wisdom to help us narrow down what are reasonable things to try testing out. Most of our designs hone in on creating a table per organizations to hold field values and dynamically creating indexes on that. The idea being that we won't kill our insert performance as much by only having one index per field to check on inserts and table statistics should still be ok. (how are statistics managed for JSONB fields?) The main question we have is what is going to happen if we have thousands (or tens of thousands) of tables on a single database? The good news is the vast majority of our organizations are idle at any point in time.
Approaches we've thought about and questions / pros / cons:
1) Add a JSONB field on our (shared across organizations) contact table, store field values there. Create JSONB indexes per unique field.
pros: nice having the locality of data on contact, multi field queries are likely way better, query performance should be good
cons: we have to create thousands of indexes? we have to use uuids as keys to keep our indexes org-specific? insert performance suffers from having thousands of partial indexes (how badly)?
2) Create a table per organization `contact_fields_[org_id]` containing a column per field. Create columns and indexes per unique field.
pros: locality is nice again, multi field queries are better, query and insert performance should be good.
cons: thousands of tables with up to 100 indexes per, is that going to blow up?
3) Create a table per organization `contacts_fields_[org_id]` which contains a `fields` JSONB column, Create JSONB indexes per unique field.
pros: locality is good, multi field queries good, query performance should be good. Adding and removing fields is a bit simpler than 2) case above and naming can be a bit clearer than 1) as we don't have to worry about multi-org key name collisions
cons: same as 2) but with a JSONB flavor, hundreds of JSONB indexes on thousands of tables, thousands of tables
4) Create a database per organization? Use 1) above
pros: all data is localized, might see performance improvements elsewhere, query and insert performance should be good
cons: mother of a refactor :) what happens with thousands of databases on a single box? is this actually better from the perspective of getting to offload currently inactive orgs?
What other approaches should we be considering? I know EAV is a pain, especially in multi-tenant situations, but we'd love to hear success (and failure) stories from the community on how they've dealt with these.
Cheers,
-Nic