I'm fairly new at Postgres and had some basic design questions. My problem is basically that I want to do large bulk imports (millions of rows) into a large DB (billions of rows) where there might already be data that needs to be updated instead of inserting a new row. I read a similar post a few days ago, but I just had a couple more questions. My DB decomposes into storing something similar to Cisco Netflow records, i.e. source IP, source Port, dest IP, dest Port, and a bit more information about network flow data, as well as 2 timestamps, a count field, and a foreign key into a small table (<100 rows) (schema below). I will be nightly importing the data (which could number in the millions of rows) and, since the DB is not quite production, I can do almost whatever I want, as long as the DB is in a query-able state by morning. The nightly imports are the only modifications to the table; otherwise the table is read-only and accessed via a web application. Needless to say, the table will get huge (I'm estimating billions of rows within a month), although it should level off as I get more and more duplicates. CREATE TABLE flow ( source_ip INET, --- SMALLINT not normally big enough (signed vs. unsigned), --- but convert in application space source_port SMALLINT, dest_ip INET, dest_port SMALLINT, comment VARCHAR(128), count INTEGER, first_seen DATE, last_seen DATE, fk INTEGER NOT NULL REFERENCES small_table(small_id) ); CREATE INDEX flow_source_ip_idx ON flow (source_ip); CREATE INDEX flow_dest_ip_idx ON flow (dest_ip); When I import the data (thousands to millions of rows), I want to check and see if there is a row that already exists with the same source_ip, dest_ip, and comment and, if so, I want to update the row to increment and update the first_seen and last_seen dates if need be. Otherwise, just insert a new row with a count of 1. Basic question: What's the best way to go about this? >From what I have read, it seemed like the consensus was to import (COPY) the new data into a temporary table and then work on the inserts and updates from there. I also read some suggestions involving triggers ... which way would be best given my dataset? I've thought about doing it in application space (Perl) by querying out all the rows that need to be updated, deleting said rows, dropping the indexes, and then doing a bulk COPY of any new rows plus the modified old rows ... does this sound like a good/bad idea? Some other basic questions: -Should I drop and recreate the indexes anytime, given that the updates and selects will use them extensively to find matching rows in the existing flow table? Perhaps create a new index on (source_ip, dest_ip)? -What do you think of the schema for storing network flow data considering that I'm only storing single IPs and I do need the comment field? Is the INET data type the best thing to use here (12 bytes, I think), or should I use my application to convert my IPs to "INTEGER" equivalents (4 bytes)? Perhaps create a new Postgres data type? Although I have another table which actually uses INET networks, and I'd like to be able to join the two, so a new data type might be more work than I initially though (although I'm up for it) :) -How many rows at a time can I practically insert with the COPY command? I've read about people doing millions, but is that realistic, or should I break my load into X-record chunks/transactions? Any suggestions/comments/flames would be appreciated. Thanks -Mike ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq