Need to replace SAN, best method with least downtime? (8.4.4)

Marinos Yannikos <mjy@xxxxxxxxxxx> · Sat, 23 Apr 2011 16:00:44 +0200

Hi,

I have a beefy server with 2 SANs, 1 "fast" (A) and 1 "slow" (B) and 1.3TB worth 
of 8.4.4 databases on A. A needs to be replaced/wiped completely with as little 
downtime as possible. It's flash-based and the modules need to be replaced, so 
no "swapping the SAN and keeping the disks". The databases are relatively busy, 
generating 8-50 16MB WAL segments per minute).

Several methods spring to mind:

a) pg_dumpall, wipe, restore (alternatively pg_dump global objects and all 
databases in parallel)

This will probably be 100% safe but take a long time (pg_dumpall takes ~440 
minutes currently), so it's not useful unless the other methods are all too 
risky. Access to DB needs to be prevented during backup to avoid data loss.

b) set up a PITR slave (warm standby) on the same box, fail over to it, replace 
SAN A, then set up a PITR slave on A and fail over to it eventually

This would probably reduce my downtime to nearly nothing (except waiting for 
slave to read in archived WAL before restarting it as master, if there is some 
backlog). I cannot judge how risky it is in terms of data integrity. Also, it 
means running at reduced performance for a long time (1.3TB "hot backup" needs 
to be performed for fail over back to SAN A).

c) set up a tablespace on B and move as many tables/databases over to it as 
possible without severe service degradation. Then shut down Postgres, perform a 
filesystem-level backup of the remaining data on A, replace A, restore, then 
move things back to the default tablespace.

Moving big tables/databases will cause service degradation or interruption, but 
only few objects are really big and those aren't critical. I am hoping to end up 
with <=150GB of data to back up/restore, which should take 20-30 minutes 
(possibly less with rsync).

What would you do and why? I am considering c) at the moment because I am unsure 
about b): I cannot check the integrity of the slave's datadir quickly before I 
wipe the SAN (or can I?) and I don't know how well the slow SAN will hold up if 
all busy tables are moved to it, also it has to be done very carefully with no 
mistakes in recovery.conf etc. or I might trash my datadir or WAL archive dir.

Is there anything unsafe about c) that I am missing here? Looking at a few 100 
tables and indices to classify and eventually move them is a lot of work, but it 
seems worth it to me.

Thanks,
 Marinos

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general