The deduplication process requires so many programmed procedures that it runs on the client. Most of the de-dupe lookups are not "straight" lookups, but calculated ones emplying fuzzy logic. This is because we cannot dictate the format of our input data and must deduplicate with what we get. This was one of the reasons why I went with PostgreSQL in the first place, because of the server-side programming options. However, I saw incredible performance hits when running processes on the server and I partially abandoned the idea (some custom-buiilt name-comparison functions still run on the server). I am using Tcl on both the server and the client. I'm not a fan of Tcl, but it appears to be quite well implemented and feature-rich in PostgreSQL. I find PL/pgsql awkward - even compared to Tcl. (After all, I'm just a programmer... we do tend to be a little limited.) The import program actually runs on the server box as a db client and involves about 3000 lines of code (and it will certainly grow steadily as we add compatability with more import formats). Could a process involving that much logic run on the db server, and would there really be a benefit? Carlo ""Jim C. Nasby"" <jim@xxxxxxxxx> wrote in message news:20060928184538.GV34238@xxxxxxxxxxxx > On Thu, Sep 28, 2006 at 01:53:22PM -0400, Carlo Stonebanks wrote: >> > are you using the 'copy' interface? >> >> Straightforward inserts - the import data has to transformed, normalised >> and >> de-duped by the import program. I imagine the copy interface is for more >> straightforward data importing. These are - buy necessity - single row >> inserts. > > BTW, stuff like de-duping is something you really want the database - > not an external program - to be doing. Think about loading the data into > a temporary table and then working on it from there. > -- > Jim Nasby jim@xxxxxxxxx > EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) > > ---------------------------(end of broadcast)--------------------------- > TIP 6: explain analyze is your friend >