Running update in chunks?

Tim Uckun <timuckun@xxxxxxxxx> · Mon, 21 Jan 2013 21:04:34 +1300

I have seen a lot of slow update questions asked both here and on
stack overflow  but they usually involve large tables. In my case the
dataset is kind of small.

I have an app in which I import data and then merge the records with
an existing table. Currently I do most of the heavy lifting with code
and it works well enough but as the imports get bigger I thought I
would rewrite the code to speed it up using postgres. Basically I get
the data which I consider to be dirty and I put it into a table using
hstore to store the data. I then run a series of update queries to
locate the "real" records in the various tables.  The import data
looks like this https://gist.github.com/4584366 and has about 98K
records in it. The lookup table is very small only a couple of hundred
records in it.

This is the query I am running

update cars.imports i
    set make_id = md.make_id
    from cars.models md where i.model_id = md.id;

Here is the analyse

"Update on imports i  (cost=2.46..49720.34 rows=138858 width=526)
(actual time=51968.553..51968.553 rows=0 loops=1)"
"  ->  Hash Join  (cost=2.46..49720.34 rows=138858 width=526) (actual
time=0.044..408.170 rows=98834 loops=1)"
"        Hash Cond: (i.model_id = md.id)"
"        ->  Seq Scan on imports i  (cost=0.00..47808.58 rows=138858
width=516) (actual time=0.010..323.616 rows=98834 loops=1)"
"        ->  Hash  (cost=1.65..1.65 rows=65 width=14) (actual
time=0.026..0.026 rows=65 loops=1)"
"              Buckets: 1024  Batches: 1  Memory Usage: 4kB"
"              ->  Seq Scan on models md  (cost=0.00..1.65 rows=65
width=14) (actual time=0.002..0.012 rows=65 loops=1)"
"Total runtime: 51968.602 ms"

This query takes fifty seconds on a macbook air with i7 processor and
eight gigs of RAM and SSD hard drive.  I am using postgres 9.2
installed with homebrew using the standard conf file.

So it seems to me that this query is running as fast as it could but
it's still much slower than doing things with code one record at a
time (using some memoization).

Anyway...  Presuming I can't really do anything to speed up this query
does it make sense to  try and do this in chunks and if so what is the
best technique for doing that.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general