On Mon, 2007-01-08 at 14:58, George Nychis wrote: > Hi, > > I have approximately 2 billion data entries that I would like to insert into a database. > Each entry consists of: > INT BOOLEAN INT BOOLEAN > > I want to populate a table such that it only contains the unique rows, all other data > should be thrown out. I would say a significant amount of the insertions are going to > fail due to unique constraints. The unique constraint is on the two integers, not on the > booleans. > > Using mysql, I was able to do this with the following query, for all data files (25574 > data files total): > mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn > FIELDS TERMINATED BY ' ';\" A quick question. Could you run selects or other inserts on that table while the load data infile was running? Cause I'm guessing that it basically locked the whole table while running. > What I *think* mysql did was sort each data file and do a sort of merge sort between the > data I was inserting and the data in the database. It would insert the first unique > instance of a row it saw, and reject all other insertions that violated the unique > constraint due to the "IGNORE". Me too. Which would require "one big lock" on the table which would mean no parallel access. It's also likely that it used a temp table which doubled the size of the database while you were inserting. > From what I understand, this functionality is not in postgresql. Fine, I certainly can't > change that. But I am looking for a comparable solution for the size of my data. > > One solution is to have a temporary table, insert all 2 billion rows, and then copy the > distinct entries to another table. This would be like one massive sort? > > Is this the only/best solution using postgresql? TANSTAAFL. PostgreSQL is designed so that you can run an import process on that table while 100 other users still access it at the same time. Because of that, you don't get to do dirty, nasty things under the sheets that allow for super easy data loading and merging like you got with MySQL. Apples and Oranges. Assuming you're loading into an empty table, the load to temp, select distinct out and into the final table seems reasonable, should run reasonably fast. If you need to load to an existing table, it might get a little more complex.