Webb Sprague wrote:
I haven't tested but this is what I would do (uses arrays, which are
handy when you need them), with the names changed to protect the
innocent:
begin;
-- create a table with some duplicates in one of the columns (y is ck);
wsprague=# select x, x%4 as y into temp fbar from generate_series(1,10) as x(a);
SELECT
wsprague=# select * from fbar;
x | y
----+---
1 | 1
2 | 2
3 | 3
4 | 0
5 | 1
6 | 2
7 | 3
8 | 0
9 | 1
10 | 2
(10 rows)
-- create a table with the pk, an array of the duplicate keys, and the
length of that array
select y, x_list, array_upper(x_list, 1) as x_len into baz from
(select y, array_accum(x) as x_list from fbar group by y ) a ;
wsprague=# select * from baz;
y | x_list | x_len
---+----------+-------
3 | {3,7} | 2
2 | {2,6,10} | 3
1 | {1,5,9} | 3
0 | {4,8} | 2
(4 rows)
-- delete all rows that don't have ck in the first element of the pk list
wsprague=# delete from fbar where not exists (select 1 from baz where
fbar.x=baz.x_list[1]);DELETE 6
wsprague=# select * from fbar;
x | y
---+---
1 | 1
2 | 2
3 | 3
4 | 0
(4 rows)
commit;
I hope that makes sense. It should be easy to make into a function
(like clean_table(table_name text, pk_name text, dup_key_name text).
I don't know how well it will work with indexes. You could probably
even write a function to do the entire import inside postgres,
starting with a copy and moving to merging tables and ending with some
consistency checks, and thus benefit from transaction isolation of
the whole process....
Hey, neat idea. I changed things a little:
test=# SELECT * FROM table1;
id | ck | content
----+----+---------
1 | 1 | string1
2 | 2 | string2
3 | 2 | string3
4 | 4 | string1
5 | 5 | string1
6 | 1 | string3
7 | 2 | string1
(7 rows)
test=# SELECT key_list, array_upper(key_list, 1) AS key_num, content
test-# INTO table2 FROM (
test-# SELECT array_accum(ck) AS key_list, content FROM table1 GROUP BY
content
test=# ) a;
SELECT
test=# SELECT * FROM table2;
key_list | key_num | content
-----------+---------+---------
{2,1} | 2 | string3
{1,4,5,2} | 4 | string1
{2} | 1 | string2
(3 rows)
I had to modify your DELETE suggestion a bit because the table1.id (your
x) is non-unique (although it must be within a given key_list). It'll be
a performance hog though because it's comparing each string.
test-# DELETE FROM table1 WHERE NOT EXISTS (
test-# SELECT 1 FROM table2
test-# WHERE table1.content = table2.content
test-# AND table1.ck = table2.key_list[1]
test-# );
DELETE 4
test=# SELECT * FROM table1;
id | ck | content
----+----+---------
1 | 1 | string1
2 | 2 | string2
3 | 2 | string3
(3 rows)
From here I can drop the table1.ck column. But, in order to place the
table2.key_list ints into a join table, I need the PK from table1. I
have the content column with which to fetch that from table1 but that
hardly seems an efficient method. So, I modified your idea for gathering
the dupe CKs a little:
test-# SELECT key_list, array_upper(key_list, 1) AS key_num, content
test-# INTO table2 FROM (
test-# SELECT array_cat(ARRAY[array_accum(id)], ARRAY[array_accum(ck)])
AS key_list, content
test-# FROM table1 GROUP BY content) a;
SELECT
test=# select * from table2;
key_list | key_num | content
-----------------------+---------+---------
{{3,6},{2,1}} | 2 | string3
{{1,4,5,7},{1,4,5,2}} | 2 | string1
{{2},{2}} | 2 | string2
(3 rows)
Huzzah! IOW, I accumulate both the PKs and ck from table1 for each
string. Then, I can ditch all but the first PK, which should make
transferring these values in to the join table a snap.
# lightbulb comes on ...
Actually, what I'll do is remove the extraneous PKs and use them to
DELETE FROM table1. Much more efficient, that.
Thanks for your ideas. And if anyone else has any suggestion please feel
free to chime in. When I get the rest of the data I'll post back here on
how it went in case google points someone here in the future.
b
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general