Scott Marlowe wrote: > On Mon, 2007-01-08 at 15:52, George Nychis wrote: >> Scott Marlowe wrote: >>> On Mon, 2007-01-08 at 14:58, George Nychis wrote: >>>> Hi, >>>> >>>> I have approximately 2 billion data entries that I would like to insert into a database. >>>> Each entry consists of: >>>> INT BOOLEAN INT BOOLEAN >>>> >>>> I want to populate a table such that it only contains the unique rows, all other data >>>> should be thrown out. I would say a significant amount of the insertions are going to >>>> fail due to unique constraints. The unique constraint is on the two integers, not on the >>>> booleans. >>>> >>>> Using mysql, I was able to do this with the following query, for all data files (25574 >>>> data files total): >>>> mysql -e \"use connectivity; LOAD DATA INFILE '/path/to/file' IGNORE INTO TABLE ipconn >>>> FIELDS TERMINATED BY ' ';\" >>> A quick question. Could you run selects or other inserts on that table >>> while the load data infile was running? Cause I'm guessing that it >>> basically locked the whole table while running. >> What does this have to do with my question? I don't need to run selects or inserts on the >> table while the load data is running... >> >>>> What I *think* mysql did was sort each data file and do a sort of merge sort between the >>>> data I was inserting and the data in the database. It would insert the first unique >>>> instance of a row it saw, and reject all other insertions that violated the unique >>>> constraint due to the "IGNORE". >>> Me too. Which would require "one big lock" on the table which would >>> mean no parallel access. >> Thats fine, it doesn't matter. >> >>> It's also likely that it used a temp table which doubled the size of the >>> database while you were inserting. >>> >>>> From what I understand, this functionality is not in postgresql. Fine, I certainly can't >>>> change that. But I am looking for a comparable solution for the size of my data. >>>> >>>> One solution is to have a temporary table, insert all 2 billion rows, and then copy the >>>> distinct entries to another table. This would be like one massive sort? >>>> >>>> Is this the only/best solution using postgresql? >>> TANSTAAFL. PostgreSQL is designed so that you can run an import process >>> on that table while 100 other users still access it at the same time. >>> Because of that, you don't get to do dirty, nasty things under the >>> sheets that allow for super easy data loading and merging like you got >>> with MySQL. Apples and Oranges. >>> >>> Assuming you're loading into an empty table, the load to temp, select >>> distinct out and into the final table seems reasonable, should run >>> reasonably fast. If you need to load to an existing table, it might get >>> a little more complex. >>> >> The goal is not to run queries while the data is being inserted....I am wondering if the >> postgresql method I have mentioned to actually insert and get only distinct values is most >> optimal, which would produce the same results method I explained in mysql. > > Did I fail to answer your question? > > Sorry if I gave you more information than you needed. Please feel free > to ask someone else next time. > ahhh i missed your last paragraph... so much text. Actually yeah that answers my question, thank you. I guess its more a single run through than the mysql method which was piece-wise. Thanks for the help/response. - George