Re: Massive table (500M rows) update nightmare

Kevin Kempter <kevink@xxxxxxxxxxxxxxxxxxx> · Thu, 7 Jan 2010 10:46:24 -0700

On Thursday 07 January 2010 09:57:59 Kevin Grittner wrote:
> Ludwik Dylag <ldylag@xxxxxxxxx> wrote:
> > I would suggest:
> > 1. turn off autovacuum
> > 1a. ewentually tune db for better performace for this kind of
> >     operation (cant not help here)
> > 2. restart database
> > 3. drop all indexes
> > 4. update
> > 5. vacuum full table
> > 6. create indexes
> > 7. turn on autovacuum
> 
> I've only ever attempted something like that with a few tens of
> millions of rows.  I gave up on waiting for the VACUUM FULL step
> after a few days.
> 
> I some scheduled down time is acceptable (with "some" kind of hard
> to estimate accurately) the best bet would be to add the column with
> the USING clause to fill in the value.  (I think that would cause a
> table rewrite; if not, then add something to the ALTER TABLE which
> would.)  My impression is that the OP would rather stretch out the
> implementation than to suffer down time, which can certainly be a
> valid call.
> 
> If that is the goal, then the real question is whether there's a way
> to tune the incremental updates to speed that phase.  Carlo, what
> version of PostgreSQL is this?  Can you show us the results of an
> EXPLAIN ANALYZE for the run of one iteration of the UPDATE?
> Information on the OS, hardware, PostgreSQL build configuration, and
> the contents of postgresql.conf (excluding all comments) could help
> us spot possible techniques to speed this up.
> 
> -Kevin
> 

If you can come up with an effective method of tracking updates/deletes/inserts 
such as a trigger that writes the PK to a separate table upon any inserts, 
updates or deletes to the table you could do something like this:

1) create new table (no indexes) with the structure you want the table to have 
at the end of the process (i.e. the post-altered state) [new_tab]

2) create the insert,update,delete triggers mentioned above on the existing 
table [curr_tab] and write all the PK id's that change into a 3rd table 
[changed_keys]

3) kick off a process that simply does a select from curr_tab into new_tab and 
populates/back-fills the new column as part of the query

4) let it run as long as it takes

5) once it's complete do this:

   Create the all the indexes on the new_tab

   BEGIN;

   LOCK TABLE curr_tab;

   DELETE from new_tab 
       where pk_id in (select distinct pk_id from changed_keys);

   INSERT into new_tab
       select * from curr_tab
       where curr_tab.pk_id in (select distinct pk_id from changed_keys);

   ALTER TABLE curr_tab RENAME to old_tab;

   ALTER TABLE new_tab RENAME to curr_tab;

   COMMIT;

Also you might want to consider partitioning this table in the process...

Once you're confident you no longer need the old table [old_tab] you can drop 
it 

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance