Re: Updating a very large table

Chris Browne <cbbrowne@xxxxxxx> · Fri, 24 Apr 2009 11:51:55 -0400

rafael.domiciano@xxxxxxxxx (Rafael Domiciano) writes:
> Hello Kevin, Thnks for response,
> Doing the alter table to add the new column was fast: ALTER TABLE table1 ADD COLUMN new_column date;
> The problem is that I have to do a update in this column, and the values are going to be the a misc of others 2 columns of the table1, something
> like this:
> update table1
> set new_column = (date)
> where
>   new_column is null;
> Postgres Version: 8.3.6
> Os.: Fedora Core 9
> 4 Gb Ram

If you try to do this in one "swell foop," it's going to take hours,
lock anything else that would want to access the table, and bloat the
table, all of which is exactly what you don't want...

I'd suggest doing the updates in more bite-sized pieces, a few
thousand tuples at a time.

Further, to make that efficient, I'd suggest adding an index, at least
temporarily, on some column in the table that's generally unique.  (A
date stamp that *tends* to vary would be plenty good enough; it
doesn't need to be strictly unique.  What's important is that there
shouldn't be many repeated values in the column.)

Thus, the initial set of changes would be done thus...

  alter table1 add column new_column timestamptz;
  create index concurrently temp_newcol_idx on table1(quasi_unique_column) where (new_column is null);

It'll take a while for that index to be available, but it's not really
necessary to use it until you have a lot of tuples converted to have
new_column set.

Then, run a query like the following:

  update table1 set new_column = [whatever calculation]
  where new_column is null and
        quasi_unique_column in
          (select quasi_unique_column from table1 where new_column is null limit 1000);

This should be repeated until it no longer finds any tuples to "fix."
Once this is complete, the temporary index may be dropped.

The number 1000 is somewhat arbitrary:

  - 1 would be bad, as that means you need to do 8 million queries to
    process an 8M tuple table

  - 8000000 would be bad, as that would try to do the whole thing in
    one big batch, taking a long time, locking things, bloating
    things, and consuming a lot of memory

1000 is materially larger than 1, but also materially smaller than
8000000.

Using 10000, instead, would mean more work is done in each
transaction; you might want to try varying counts, and stop increasing
the count when you cease to see improvements due to doing more work in
bulk.  I doubt that there's a material difference between 1000 and
10000.

Make sure that the table is being vacuumed once in a while; that
doesn't need to be continuous, but if you want the table to only bloat
by ~10%, then that means you should vacuum once for every 10% of the
table.
-- 
(reverse (concatenate 'string "gro.mca" "@" "enworbbc"))
http://www3.sympatico.ca/cbbrowne/advocacy.html
Mary had a little lambda
A sheep she couldn't clone
And every where that lambda went
Her calculus got blown

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin