Yes, but it should become a bit slower if you fix your code :-)
where t_imp.id is null and test.id=t_imp.id;
=>
where t_imp.id is not null and test.id=t_imp.id;
and a partial index on matching rows might help (should be tested):
(after the first updat)
create index t_imp_ix on t_imp(t_value,t_record,output_id) where t_imp.id is not null.
regards,
Marc Mamin
-----Ursprüngliche Nachricht-----
Von: pgsql-performance-owner@xxxxxxxxxxxxxx im Auftrag von Jochen Erwied
Gesendet: Sa 1/7/2012 12:57
An: antoine@xxxxxxxxx
Cc: pgsql-performance@xxxxxxxxxxxxxx
Betreff: Re: Duplicate deletion optimizations
Friday, January 6, 2012, 4:21:06 PM you wrote:
>> Every 5 minutes, a process have to insert a few thousand of rows in this
>> table, but sometime, the process have to insert an already existing row
>> (based on values in the triplet (t_value, t_record, output_id). In this
>> case, the row must be updated with the new count value. I've tried some
>> solution given on this stackoverflow question [1] but the insertion rate
>> is always too low for my needs.
I did check the following in a loop, starting with an empty table, and
inserting/updating 50000 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.
The generator runs in perl, and generates records for a maximum of 100
million different entries:
use strict;
srand time;
my $i = 0;
open FD, ">data.in";
for (1..50000)
{
$i += rand(2000);
print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, rand(1000));
}
close FD;
The SQL-script looks like this:
\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record integer,output_id integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);
-- find matching rows
update t_imp
set id=test.id
from test
where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id);
-- update matching rows using primary key
update test
set count=t_imp.count
from t_imp
where t_imp.id is null and test.id=t_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
select t_value,t_record,output_id,count
from t_imp
where id is null;
commit;
Advantages of this solution:
- all updates are done in-place, no index modifications (except for the
inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table
Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6
GHz, table and indices stored on a SSD)
Table statistics:
relid | 14332525
schemaname | public
relname | test
seq_scan | 8
seq_tup_read | 111541821
idx_scan | 149240169
idx_tup_fetch | 117901695
n_tup_ins | 30280175
n_tup_upd | 0
n_tup_del | 0
n_tup_hot_upd | 0
n_live_tup | 30264431
n_dead_tup | 0
last_vacuum |
last_autovacuum |
last_analyze |
last_autoanalyze | 2012-01-07 12:38:49.593651+01
vacuum_count | 0
autovacuum_count | 0
analyze_count | 0
autoanalyze_count | 31
The sequential scans were from some 'select count(*)' in between.
HTH.
--
Jochen Erwied | home: jochen@xxxxxxxxx +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 | work: joe@xxxxxxxxxxxxxxx +49-2151-7294-24, FAX: -50
D-45470 Muelheim | mobile: jochen.erwied@xxxxxxxxxxx +49-173-5404164
--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance