On 4/29/15 6:04 PM, Dave Owens wrote:
A. Text-processing a dump of the original database, filtering only the rows in which I am interested, while loading the dump into the new database. B. Copying the original database (CREATE DATABASE smalldb WITH TEMPLATE bigdb), then modifying the schema so that each FK constraint now includes ON DELETE CASCADE, then deleting rows from the few core tables while letting postgresql cascade the deletes down the hierarchy. Then, of course, restoring the schema to its original state.
This would be quite slow, but will work.
C. Generating the myriad SELECTs necessary to export only the required data from every table, then importing those results (either using dblink or COPY FROMs) into the new database that has been populated with schema from the original database. Carefully ordering the imports to avoid missing FKs.
D. Combine A and B. FKs are added AFTER data is loaded in a SQL dump. So after the COPY commands are done you could run relevant DELETES. Brute force, but it'd do the job and be faster than cascade deletes.
E. A spin on C would be to put triggers on the tables to silently drop data that you don't want.
All that said, my guess is you're doing this to support horizontal scale-out, which means you'll probably need to do this more than once, and it'd presumably be nice for you and your customers if this didn't require downtime. I would look at having a way to create a partial replica using londiste (or BDR if it'd support it). The trick there is having a way to identify whether you want each individual row on a replica. If you add some kind of cluster_id field to every table that makes doing that filtering pretty easy; IIRC londiste supports that out of the box.
The other option is having functions that would let you determine whether you want a row. I know you could get londiste to do this, but you'd have to implement some of the filtering in python (Skype actually does, or at least did this).
One more thing to consider: You can scale PG to some serious database sizes if you're willing to spend money on hardware. Given what kind of a server you can get for $20k now-a-days, I'd seriously investigate that if you haven't already.
BTW, there's also https://github.com/citusdata/pg_shard. I've never really looked at it, but it might make all of this a lot easier.
-- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general