I haven't tested but this is what I would do (uses arrays, which are handy when you need them), with the names changed to protect the innocent: begin; -- create a table with some duplicates in one of the columns (y is ck); wsprague=# select x, x%4 as y into temp fbar from generate_series(1,10) as x(a); SELECT wsprague=# select * from fbar; x | y ----+--- 1 | 1 2 | 2 3 | 3 4 | 0 5 | 1 6 | 2 7 | 3 8 | 0 9 | 1 10 | 2 (10 rows) -- create a table with the pk, an array of the duplicate keys, and the length of that array select y, x_list, array_upper(x_list, 1) as x_len into baz from (select y, array_accum(x) as x_list from fbar group by y ) a ; wsprague=# select * from baz; y | x_list | x_len ---+----------+------- 3 | {3,7} | 2 2 | {2,6,10} | 3 1 | {1,5,9} | 3 0 | {4,8} | 2 (4 rows) -- delete all rows that don't have ck in the first element of the pk list wsprague=# delete from fbar where not exists (select 1 from baz where fbar.x=baz.x_list[1]);DELETE 6 wsprague=# select * from fbar; x | y ---+--- 1 | 1 2 | 2 3 | 3 4 | 0 (4 rows) commit; I hope that makes sense. It should be easy to make into a function (like clean_table(table_name text, pk_name text, dup_key_name text). I don't know how well it will work with indexes. You could probably even write a function to do the entire import inside postgres, starting with a copy and moving to merging tables and ending with some consistency checks, and thus benefit from transaction isolation of the whole process.... HTH On Sat, Mar 8, 2008 at 9:42 PM, brian <brian@xxxxxxxxxxxxxxxx> wrote: > I have ~350K rows of sample data that has come to me in 64 text files > (CSV) that I'd like to import into a new database. Each file can be > considered its own category and is so named. That is, each file name > will be inserted into a separate categories table. I'd like to relate > each row to its category. However, while the rows are unique in each > file, some rows are found in more than one file. > > I also must parse some of the fields into separate columns, but that's > no big deal. But it means that I must do some pre-processing on these > files, in any case. > > After some thought, I thought I might brute-force the problem with Perl > by reading each line of each file into an assoc. array unless it's > already in it, in which case I'd append the "key" based on the list of > categories that line is found in (I'd still need to parse outthe keys > later but I'd get rid of the dupes). Each array row would be like so: > > 'key' => '1,4,etc.', 'text' => 'a line' > > Aside from the fact that the array search would become ridiculously > large after just a couple of files, I realised that this is a > non-starter for the simple fact that this data comprises less than 25% > of the eventual total. So refactoring it in this way would be a waste of > time (there will probably be dupes in the files to come). > > So, I'd like to instead parse out my columns properly and write each > line (along with its category key) to a new, single file to be copied > into a working table. ONce I've done so, is there an efficient method I > can use to select all duplicates (save for the category key) into a set > from which I could then select into the final table (and insert the keys > into the category join table)? > > For example (pk is the PK from the working table and ck is the category > key), my dupes query on the working table would give the following set: > > pk ck > 1 1 a a a a > 2 3 a a a a > 3 3 b b b b > 4 7 b b b b > 5 4 a a a a > > I would then want to insert just the unique rows into the final table > yet add all of the the PKs and CKs to the category join table. After > that was done, I'd delete all of these from the working table and then > move the unique rows that are left to the final table (and insert the > keys into the join table). > > I hope that makes sense. I'm not looking for anyone to do my homework > for me; I'm sure I could fix up a tasty function for this (the data is > destined for MySQL, alas, but I'll be damned if I won't use PG for the > heavy lifting). What I'm really looking for is a handy way to grab all > of those dupes. > > -- > Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-general > -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general