Hi,
I just created a new column on a relatively large table (~ 2 mil rows over 400 mb in tablesize) and am trying to populate it from another much smaller table by using an update query .
The purpose of the query to set a bool flag if the user for the action on table1 is the last user, as determined from table 2.
The large table has a two column primary key and one of the columns is the primary key on the smaller table.
My original update query is as follows:
update table1
set lastuser = (table1.user =
table2.lastuser)::bool from table2
where table1.pkeycolumn1 = table2.pkey
which has an explain output of
Merge Join (cost=883182.60..1024622.03 rows=6873573 width=89)
Merge Cond: ("outer".pkeycolumn1 = "inner".pkey)
-> Sort (cost=93387.50..94711.01 rows=529404 width=18)
Sort Key: table2.pkey
-> Seq Scan on table2 (cost=0.00..15037.04 rows=529404 width=18)
-> Sort (cost=789795.10..795605.39
rows=2324116 width=81)
Sort Key: table1.pkeycolumn1
-> Seq Scan on table1 (cost=0.00..71900.16 rows=2324116 width=81)
This seems like it *should* be the best query to me, but when I try and run it it takes forever.
However, when I add a subselect clause to the end to force the planner to use a nested loop
select on table2, the query runs MUCH faster (1 hour instead of > 10, I never let it finish)
Here is the query:
update table1
set lastuser = (table1.user = table2.lastuser)::bool from table2
where table1.pkeycolumn1 = table2.pkey
and table1.pkeycolumn1 is in ( select pkey from table3 limit 1000000)
where table1.pkeycolumn1 is a foreign key in table3.pkey. Also, note that the num of rows in table 3 is *much* less than 1000000.
Also, the limit clause is necessary to force the planner into using the nested loop. (Aside: Is there a better way to do this inside of a query?)
Here is the explain for the above query:
Hash Join (cost=13863.09..109298.79 rows=51388 width=89)
Hash Cond: ("outer".pkeycolumn1 = "inner".pkey)
-> Seq Scan on table1 (cost=0.00..71900.16
rows=2324116 width=81)
-> Hash (cost=13854.99..13854.99 rows=1621 width=26)
-> Nested Loop (cost=8205.72..13854.99 rows=1621 width=26)
-> HashAggregate (cost=8205.72..8207.72
rows=200 width=8)
-> Limit (cost=0.00..5891.43 rows=185143 width=8)
-> Seq Scan on table3 (cost=0.00..5891.43 rows=185143 width=8)
-> Index Scan using table2-pkey-index" on table2 (cost=
0.00..28.14 rows=8 width=18)
Index Cond: ("outer".pkey = table2.pkey)
My only theory up to this point is that it has something to do with how the tablespace is being cached. I notice that when I use plan 1, my
computer goes through long periods of io and with bursts of processor activity every minute or so. When I use plan 2, the io sits at about 90%
of my total resource usage while my normal processor usage sits at about 10%. Maybe it keeps trying to cache and resort table2? I don't really
have any idea, but that is my only guess.
If anyone knows why this may be happenning, I would really appreciate it.
Thanks,
Nathan