Re: Duplicate deletion optimizations

Jochen Erwied <jochen@xxxxxxxxxxxxxxxxxxxxxxxxxxx> · Sat, 7 Jan 2012 12:57:26 +0100

Friday, January 6, 2012, 4:21:06 PM you wrote:

>> Every 5 minutes, a process have to insert a few thousand of rows in this
>> table, but sometime, the process have to insert an already existing row
>> (based on values in the triplet (t_value, t_record, output_id). In this
>> case, the row must be updated with the new count value. I've tried some
>> solution given on this stackoverflow question [1] but the insertion rate
>> is always too low for my needs.

I did check the following in a loop, starting with an empty table, and
inserting/updating 50000 random unique entries. After 15 minutes I've got
about 10 million records, each loop takes about 3 seconds. After 30 minutes
the table contains approx. 18 million entries, time per loop only slightly
increased. After 90 minutes the database has about 30 million entries. The
speed has dropped to about 15-20 seconds per loop, but the server is doing
lots of other queries in parallel, so with an unloaded server the updates
should still take less than 10 seconds.

The generator runs in perl, and generates records for a maximum of 100 
million different entries:

use strict;

srand time;
my $i = 0;
open FD, ">data.in";
for (1..50000)
{
        $i += rand(2000);
        print FD sprintf("%d\t%d\t%d\t%d\n", $i/65536, ($i/256)%255, $i%255, rand(1000));
}
close FD;

The SQL-script looks like this:

\timing on
begin;
create temp table t_imp(id bigint,t_value integer,t_record integer,output_id integer,count bigint);
\copy t_imp (t_value, t_record, output_id, count) from 'data.in'
--an index is not really needed, table is in memory anyway
--create index t_imp_ix on t_imp(t_value,t_record,output_id);

-- find matching rows
update t_imp
       set id=test.id
       from test
       where (t_imp.t_value,t_imp.t_record,t_imp.output_id)=(test.t_value,test.t_record,test.output_id);
-- update matching rows using primary key
update test
       set count=t_imp.count
       from t_imp
       where t_imp.id is null and test.id=t_imp.id;
-- insert missing rows
insert into test(t_value,t_record,output_id,count)
       select t_value,t_record,output_id,count
              from t_imp
              where id is null;
commit;

Advantages of this solution:

- all updates are done in-place, no index modifications (except for the 
  inserts, of course)
- big table only gets inserts
- no dead tuples from deletes
- completely avoids sequential scans on the big table

Tested on my home server (8GB RAM, 3GB shared memory, Dual-Xeon 5110, 1.6 
GHz, table and indices stored on a SSD)

Table statistics:

relid             | 14332525
schemaname        | public
relname           | test
seq_scan          | 8
seq_tup_read      | 111541821
idx_scan          | 149240169
idx_tup_fetch     | 117901695
n_tup_ins         | 30280175
n_tup_upd         | 0
n_tup_del         | 0
n_tup_hot_upd     | 0
n_live_tup        | 30264431
n_dead_tup        | 0
last_vacuum       |
last_autovacuum   |
last_analyze      |
last_autoanalyze  | 2012-01-07 12:38:49.593651+01
vacuum_count      | 0
autovacuum_count  | 0
analyze_count     | 0
autoanalyze_count | 31

The sequential scans were from some 'select count(*)' in between.

HTH.

-- 
Jochen Erwied     |   home: jochen@xxxxxxxxx     +49-208-38800-18, FAX: -19
Sauerbruchstr. 17 |   work: joe@xxxxxxxxxxxxxxx  +49-2151-7294-24, FAX: -50
D-45470 Muelheim  | mobile: jochen.erwied@xxxxxxxxxxx       +49-173-5404164

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance