Re: creating a subset DB efficiently ?

Shaheed Haque <shaheedhaque@xxxxxxxxx> · Sat, 9 Mar 2024 18:44:50 +0000

I go more or less the other way. I have a script which follows the child...parent...grandparent...etc foreign keys in the source database and dumps only the records belonging to the selected "project" (your terminology, in my case it is "client"). I save the dumped data to an archive file. 

The script has hardcoded knowledge only about the absolute minimum number of the root ancestor tables (and certain special cases mostly relating to the fact I'm talking about a Django ORM-managed schema) and otherwise tries to avoid knowing too much so that the schema can evolve mostly without bothering the script. 

The script includes meta data about the dump in a "manifest". The manifest records, amongst other things, the order in which the tables were dumped. The restore operation uses this to correctly order the table restores. 

I can then simply restore the archive, table by table, using the individual dumps using a script which walks the manifest. 

On Sat, 9 Mar 2024, 14:56 hector vass, <hector.vass@xxxxxxxxx> wrote:
On Fri, Mar 8, 2024 at 4:22 PM David Gauthier <dfgpostgres@xxxxxxxxx> wrote:
Here's the situation....
- The DB contains data for several projects.  
- The tables of the DB contain data for all projects (data is not partitioned on project name or anything like that)
- The "project" identifier (table column) exists in a few "parent" tables with many child... grandchild,... tables under them connected with foreign keys defined with "on delete cascade".  So if a record in one of the parent table records is deleted, all of its underlying, dependent records get deleted too.
- New projects come in, and old ones need to be removed and "archived" in DBs of their own.  So there's a DB called "active_projects" and there's a DB called "project_a_archive" (identical metadata).
- The idea is to copy the data for project "a" that's in "active_projects" to the "project_a_arhchive" DB AND delete the project a data out of "active_projects".
- Leave "project_a_archive" up and running if someone needs to attach to that and get some old/archived data.  

The brute-force method I've been using is...
1)  pg_dump "active_projects" to a (huge) file then populate "project_a_archive" using that (I don't have the privs to create database, IT creates an empty one for me, so this is how I do it).  
2) go into the "project_a_archive" DB and run... "delete from par_tbl_1 where project <> 'a' ", "delete from par_tbl_2 where project <> 'a' ", etc... leaving only project "a" data in the DB.  
3) go into the "active_projects" DB and "delete from par_tbl_1 where project = 'a' ", etc... removing project "a" from the "active_projects DB.

Ya, not very elegant, it takes a long time and it takes a lot of resources.  So I'm looking for ideas on how to do this better.

Related question...
The "delete from par_tbl_a where project <> 'a' " is taking forever.  I fear it's because it's trying to journal everything in case I want to rollback.  But this is just in the archive DB and I don't mind taking the risk if I can speed this up outside of a transaction.  How can I run a delete command like this without the rollback recovery overhead ?     

>(I don't have the privs to create database, IT creates an empty one for me, so this is how I do it).   

That's a shame.  You can do something similar with tablespaces
  Template your existing schema to create a new schema for the project (pg_dump -s)
  Create tablespace for this new project and schema
 You can then move the physical tablespace to cheaper disk and use symbolic links or... archive and/or back it up at the schema level with pg_dump -n
...as long as you don't put anything in the public schema all you are really sharing is roles otherwise a bit like a separate database