Re: One database vs. hundreds?

Shane Ambler <pgsql@xxxxxxxxxx> · Wed, 29 Aug 2007 01:14:14 +0930

Kynn Jones wrote:
I'm hoping to get some advice on a design question I'm grappling with.
 I have a database now that in many respects may be regarded as an
collection of a few hundred much smaller "parallel databases", all
having the same schema.  What I mean by this is that, as far as the
intended use of this particular system there are no meaningful queries
whose results would include information from more than one of these
parallel component databases.  Furthermore, one could delete all the
records of any one of these parallel components without affecting the
referential integrity of the rest of the database.

Therefore, both for performance and maintenance reasons, the idea of
splitting this database into its components looks very attractive.
This would result in a system with hundreds of small databases (and in
the future possibly reaching into the low thousands).  I don't have
experience with such a situation, and I'm wondering if there are
issues I should be concerned about.

Alternatively, maybe there are techniques to achieve the benefits of
this split without actually carrying it out.  The two benefits I see
are in the areas of performance and maintenance.  As for performance,
I assume (naively, I'm sure) that searches will be faster in the
individual component databases, simply because it's a search among
fewer pieces of information.  And for maintenance, I think the split
would make the system more robust during database updates, because
only a small component would be updated at a time, and the rest of the
system would completely insulated from this.

I'd very much appreciate your thoughts on these issues.

I imagine I'm not the first person to confront this kind of design
choice.  Does it have a standard name that I could use in a Google
search?

I see four ways you can go with this and the choice will be the one that 
makes more sense to the way you see yourself accessing the data with 
only one of which affecting the performance the way you mention.

1. is to have each set of tables and data in a separate database.
2. is to have each set of tables duplicated in a different schema name.
3. is to have it all in one set of tables with a column to identify 
which set the row belongs to.
4. is to use the above with ddl-partitioning to achieve data separation 
removing the performance costs of having it all in one table.

The first leaves you with complete data separation, you log in 
differently to access each data set.
If you using psql then you would use a connect command to look at 
different data. Other clients may mean you need to disconnect then connect.

The second allows you to to use the same log in command and you can 
either use schema qualifications with the table names to access the data 
set you want, or change the default schema to change between data sets.
The difference will depend on your client. You can use schema qualified 
names with your client inserting the selected schema when generating the 
sql. Or you may use SET search_path TO myschema; then SELECT without 
schema qualified names, leaving your sql as it is.

The last two are similar with the forth using the ddl setup to separate 
the data into separate tables for you, so you will still have the same 
(or very close) performance as if each is separated by schema or db.
Here you would have to include datasetidcolumn=x within each sql 
statement. Again this may be inserted by your client as it generates the 
sql.

Whether the tables are separated by dbname, schema name or 
ddl-partitioning each will be a separate table using their own indexes etc.

As far as backup/restore goes, you won't find much difference with 
either option - 3 will mean they will be all together in one backup 
file, the others mean adding a line to your backup script to dump the 
data set to it's own file. If you just want to dumpall to one file then 
it won't matter either way.

Apart from the first the other options allow you to share data between 
data sets - a postcode list, state names, country names...

The choice may also be affected by your knowledge of sql. The 
ddl-partitioning may make your table structure more complex and harder 
for you to think with (mainly when defining a new data set). Yet adding 
a column will mean adding it in one place (the parent table) not 
repeating it for each db or schema.

As you mention increasing to a few thousand db's - what sort of server 
load are you expecting? Would moving a few hundred of the db's to 
another server be easier than setting up load balancing/replication when 
things get too busy for your current server?

Of course if you ever want to report on stats comparing all db's then 
the last two will be the easiest.
You may change your mind there later - even if each data set is a 
clients financial records - what if you want to ask which clients spend 
too much money on office supplies? which clients will be affected by the 
new tax changes? or which clients have the largest profit margins? list 
top ten clients with the most turnover.
If they all relate to different websites then which is the least 
productive and not worth continuing?

I guess I rattled on a bit there - but that should cover most of the 
points you will need to consider to make the choice. Even though you say 
you don't need the data all together think forward to what may happen in 
the future, and how much work it will be if you do change your mind.

--

Shane Ambler
pgSQL@xxxxxxxxxx

Get Sheeky @ http://Sheeky.Biz

---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
      subscribe-nomail command to majordomo@xxxxxxxxxxxxxx so that your
      message can get through to the mailing list cleanly