Re: database split

Jim Nasby <Jim.Nasby@xxxxxxxxxxxxxx> · Wed, 29 Apr 2015 18:30:36 -0500

On 4/29/15 6:04 PM, Dave Owens wrote:

A. Text-processing a dump of the original database, filtering only the
rows in which I am interested, while loading the dump into the new database.

B. Copying the original database (CREATE DATABASE smalldb WITH TEMPLATE
bigdb), then modifying the schema so that each FK constraint now
includes ON DELETE CASCADE, then deleting rows from the few core tables
while letting postgresql cascade the deletes down the hierarchy.  Then,
of course, restoring the schema to its original state.

This would be quite slow, but will work.

C. Generating the myriad SELECTs necessary to export only the required
data from every table, then importing those results (either using dblink
or COPY FROMs) into the new database that has been populated with schema
from the original database.  Carefully ordering the imports to avoid
missing FKs.

D. Combine A and B. FKs are added AFTER data is loaded in a SQL dump. So 
after the COPY commands are done you could run relevant DELETES. Brute 
force, but it'd do the job and be faster than cascade deletes.

E. A spin on C would be to put triggers on the tables to silently drop 
data that you don't want.

All that said, my guess is you're doing this to support horizontal 
scale-out, which means you'll probably need to do this more than once, 
and it'd presumably be nice for you and your customers if this didn't 
require downtime. I would look at having a way to create a partial 
replica using londiste (or BDR if it'd support it). The trick there is 
having a way to identify whether you want each individual row on a 
replica. If you add some kind of cluster_id field to every table that 
makes doing that filtering pretty easy; IIRC londiste supports that out 
of the box.

The other option is having functions that would let you determine 
whether you want a row. I know you could get londiste to do this, but 
you'd have to implement some of the filtering in python (Skype actually 
does, or at least did this).

One more thing to consider: You can scale PG to some serious database 
sizes if you're willing to spend money on hardware. Given what kind of a 
server you can get for $20k now-a-days, I'd seriously investigate that 
if you haven't already.

BTW, there's also https://github.com/citusdata/pg_shard. I've never 
really looked at it, but it might make all of this a lot easier.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general