delete vs insert vs update due to primary key dups -> which is better

Ow Mun Heng <Ow.Mun.Heng@xxxxxxx> · Tue, 28 Aug 2007 13:21:28 +0800

Continuining with my efforts to get similar functionality as mysql's
mysqlimport --replace I want to ask for the list's opinion on which is
better

What currently is happening
1. select from mssql (into CSV via PerlDBI)
2. psql\copy into PG 
3. pg chokes on duplicate pkeys as there's no --replace option

alternatives:
1. pgloader -> does not help to resolve dup pkeys. (my limited testing
also found that it doesn't do the rejects into a specified location)
(see previous posts)

2. rules on tables (create rule replace as on insert to where exists
(select 1 from xxx do instead update xxx etc..) to switch from insert to
update rule and load from a temp_table. (\copy to temp_table)
--> Does not help for \copy as it bypasses rules.
--> harder to maintain due to complexities of rules and based on what I
read/heard (IRC) is not as good an option.
(see previous posts)

3. \copy to temp_table, do
--> update main_table set value=t2.value from temp_table t2 where
main_table.pkey = t2.pkey
(alternative yet better way per depezs is append "and (main_table.value
is distinct from t2.value)" which looks only for distinct values
--> insert into main_table select * from temp_table t2 where not exists
(select * from main_table where main_table.pkey = t2.pkey)

4. \copy to temp_table, do
--> delete from main_table where main_table.pkey in (select t2.pkey from
temp_table t2)
--> insert into main_table select * from temp_table

btw, pkey was actually a multiple column composite key which I've
concatenated together as varchar :

eg: select a+cast(b as varchar)+cast(c as
varchar)+cast(datediff(s,'19700101',timestamp) as pkey from
mssql_server_table

to make it simpler for the insertion/update/deletion

Option 1 and 2 is already out.. now, left with option 3 and 4.

Option 3 seems to be better because it will leave less dead tuples due
to MVCC but the update procedure will end up to be a bit of a
maintenance nightmare each time a new column is added. (and it is added
quite frequently!! alas!)

Option 4 is simpler, delete all the duplicate p_keys in the main_table
before inserting it with the newest data (temp_table will _always_ have
the newest unique data for that table)

Only thing I'm worried about option 4 is, MVCC and MVCC (i guess) and
the dead-tuples and the additional space.

Vacuum often???

Please, I would like your comments on this. Thanks in Advance.

On average, daily will have between 5K to 250K of unique rows

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org/