Re: [NOVICE] Partitioning

Kevin Hunter <hunteke@xxxxxxxxxxx> · Tue, 26 Dec 2006 17:29:58 -0500

On 26 Dec 2006 at  2:55p -0500, Tom Lane wrote:
Kevin Hunter <hunteke@xxxxxxxxxxx> writes:
A friend has asked me about creating a unique table for individual users 
that sign up for his site.  (In essence, each user who signs up would 
essentially get a set of CREATE TABLE {users,friends,movies,eats}_<id> ( 
... ); statements executed, the idea being to reduce the number of rows 
that the DB needs to search or index/update in regards to a particular 
user id.)  The just seems ludicrous to me, because the database still 
needs to find those tables from its internal structure, not to mention 
that it just seems silly to me from a design perspective.  Something 
about unable to optimize any queries because not only is the WHERE 
clause in flux, but so is the FROM clause.

Question: Could someone explain to me why this would be bad idea, 
because I can't put into words why it is.

I thought you did a fine job right there ;-).  In essence this would be 
replacing one level of indexing with two, which is unlikely to be a win. 
If you have exactly M rows in each of N tables then theoretically your 
lookup costs would be about O(log(N) + log(M)), which is nominally the 
same as O(log(M*N)) which is the cost to index into one big table --- so 
at best you break even, and that's ignoring the fact that index search 
has a nonzero startup cost that'll be paid twice in the first case. 
But the real problem is that if the N tables contain different numbers 
of rows then you have an unevenly filled search tree, which is a net 
loss.

Hurm.  If I remember my Algorithms/Data Structures course, that implies
that table lookup is implemented with a B-Tree . . . right?  Since at
SQL preparation time the tables in the query are known, why couldn't you
use a hash lookup?  In the above case, that would make it effectively
O(1 + log(M)) or O(log(M)).  Granted, it's /still/ a bad idea because of
the next paragraph . . .

Most DBMSes aren't really designed to scale to many thousands of tables 
anyway.  In Postgres this would result in many thousands of files in 
the same database directory, which probably creates some filesystem 
lookup inefficiencies in addition to whatever might be inherent to 
Postgres.

So, still a bad idea, but I couldn't immediately think of why.  Thank you.

Partitioning is indeed something that is commonly done, but on a very 
coarse grain --- you might have a dozen or two active partitions, not 
thousands.  The point of partitioning is either to spread a huge table 
across multiple filesystems (and how many filesystems have you got?) 
or else to make predictable removals of segments of the data cheap (for 
instance, dropping the oldest month's worth of data once a month, in a 
table where you only keep the last year or so's worth of data on-line).

Ah!  I was missing where to hang/put partitioning in my head.  Thank you 
again!

I can't see doing it on a per-user basis.

Perhaps not on a per-user basis, but he could certainly improve access
times by partitioning even coursely.  I'll point him in that direction 
Are there other, perhaps better ways to improve the access times?  (Now 
I'm curious just for my sake.)  The best that I keep reading is just to 
do as much in parallel as possible (i.e. multiple systems) and to use 
Postgres ( :) ).

Thanks,

Kevin