Re: Duplicate deletion optimizations

antoine@xxxxxxxxx · Mon, 09 Jan 2012 13:59:59 +0100

Hello,

Thanks for your numerous and complete answers!

For those who have asked for more information about the process and 
hardware:

The goal of the process is to compute data from a nosql cluster and 
write results in a PostgreSQL database. This process is triggered every 
5 minutes for the latest 5 minutes data. 80% of data can be wrote in the 
database with a simple copy, which is the fastest solution we found for 
bulk insertion. But for some data, duplicates are possible (but very 
unusual), and the new data must replace the old one in database. I'm 
looking for the fastest solution to do this upsert.

About the hardware:

The PostgreSQL database run on a KVM virtual machine, configured with 
8GB of ram and 4 cores of a L5640 CPU. The hypervisor have two 7,2k 
standard SAS disks working in linux software raid 1. Disks are shared by 
VMs, and obviously, this PostgreSQL VM doesn't share its hypervisor with 
another "write-intensive" VM.

Also, this database is dedicated to store the data outgoing the 
process, so I'm really free for its configuration and tuning. I also 
plan to add a replicated slave database for read operations, and maybe 
do a partitioning of data, if needed.

If I summarize your solutions:

 - Add an "order by" statement to my initial query can help the planner 
to use the index.
 - Temporary tables, with a COPY of new data to the temporary table and 
a merge of data (you proposed different ways for the merge).
 - Use EXISTS statement in the delete (but not recommended by another 
reply)

I'll try your ideas this week, and I'll give you results.

Antoine.

--
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance