On 9/28/06, Carlo Stonebanks <stonec.register@xxxxxxxxxxxx> wrote:
The deduplication process requires so many programmed procedures that it runs on the client. Most of the de-dupe lookups are not "straight" lookups, but calculated ones emplying fuzzy logic. This is because we cannot dictate the format of our input data and must deduplicate with what we get. This was one of the reasons why I went with PostgreSQL in the first place, because of the server-side programming options. However, I saw incredible performance hits when running processes on the server and I partially abandoned the idea (some custom-buiilt name-comparison functions still run on the server).
imo, the key to high performance big data movements in postgresql is mastering sql and pl/pgsql, especially the latter. once you get good at it, your net time of copy+plpgsql is going to be less than insert+tcl. merlin