Search Postgresql Archives

Re: data import - duplicates

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 





Webb Sprague wrote:
I haven't tested but this is what I would do (uses arrays, which are
handy when you need them), with the names changed to protect the
innocent:

begin;
-- create a table with some duplicates in one of the columns (y is ck);
wsprague=# select x, x%4 as y into temp fbar from generate_series(1,10) as x(a);
SELECT

wsprague=# select * from fbar;
 x  | y
----+---
  1 | 1
  2 | 2
  3 | 3
  4 | 0
  5 | 1
  6 | 2
  7 | 3
  8 | 0
  9 | 1
 10 | 2
(10 rows)

-- create a table with the pk, an array of the duplicate keys, and the
length of that array
select y, x_list, array_upper(x_list, 1) as x_len into baz from
(select y, array_accum(x) as x_list from fbar group by y ) a ;

wsprague=# select * from baz;
 y |  x_list  | x_len
---+----------+-------
 3 | {3,7}    |     2
 2 | {2,6,10} |     3
 1 | {1,5,9}  |     3
 0 | {4,8}    |     2
(4 rows)

-- delete all rows that don't have ck in the first element of the pk list
wsprague=# delete from fbar where not exists (select 1 from baz where
fbar.x=baz.x_list[1]);DELETE 6
wsprague=# select * from fbar;
 x | y
---+---
 1 | 1
 2 | 2
 3 | 3
 4 | 0
(4 rows)

commit;

I hope that makes sense.  It should be easy to make into a function
(like clean_table(table_name text, pk_name text, dup_key_name text).
I don't know how well it will work with indexes.  You could probably
even write a function to do the entire import inside postgres,
starting with a copy and moving to merging tables and ending with some
consistency checks, and thus benefit from  transaction isolation of
the whole process....



Hey, neat idea. I changed things a little:

test=# SELECT * FROM table1;
 id | ck | content
----+----+---------
  1 |  1 | string1
  2 |  2 | string2
  3 |  2 | string3
  4 |  4 | string1
  5 |  5 | string1
  6 |  1 | string3
  7 |  2 | string1
(7 rows)

test=# SELECT key_list, array_upper(key_list, 1) AS key_num, content
test-# INTO table2 FROM (
test-# SELECT array_accum(ck) AS key_list, content FROM table1 GROUP BY content
test=# ) a;
SELECT

test=# SELECT * FROM table2;
 key_list  | key_num | content
-----------+---------+---------
 {2,1}     |       2 | string3
 {1,4,5,2} |       4 | string1
 {2}       |       1 | string2
(3 rows)


I had to modify your DELETE suggestion a bit because the table1.id (your x) is non-unique (although it must be within a given key_list). It'll be a performance hog though because it's comparing each string.


test-# DELETE FROM table1 WHERE NOT EXISTS (
test-# SELECT 1 FROM table2
test-# WHERE table1.content = table2.content
test-# AND table1.ck = table2.key_list[1]
test-# );
DELETE 4

test=# SELECT * FROM table1;
 id | ck | content
----+----+---------
  1 |  1 | string1
  2 |  2 | string2
  3 |  2 | string3
(3 rows)

From here I can drop the table1.ck column. But, in order to place the table2.key_list ints into a join table, I need the PK from table1. I have the content column with which to fetch that from table1 but that hardly seems an efficient method. So, I modified your idea for gathering the dupe CKs a little:

test-# SELECT key_list, array_upper(key_list, 1) AS key_num, content
test-# INTO table2 FROM (
test-# SELECT array_cat(ARRAY[array_accum(id)], ARRAY[array_accum(ck)]) AS key_list, content
test-# FROM table1 GROUP BY content) a;
SELECT

test=# select * from table2;
       key_list        | key_num | content
-----------------------+---------+---------
 {{3,6},{2,1}}         |       2 | string3
 {{1,4,5,7},{1,4,5,2}} |       2 | string1
 {{2},{2}}             |       2 | string2
(3 rows)

Huzzah! IOW, I accumulate both the PKs and ck from table1 for each string. Then, I can ditch all but the first PK, which should make transferring these values in to the join table a snap.

# lightbulb comes on ...

Actually, what I'll do is remove the extraneous PKs and use them to DELETE FROM table1. Much more efficient, that.

Thanks for your ideas. And if anyone else has any suggestion please feel free to chime in. When I get the rest of the data I'll post back here on how it went in case google points someone here in the future.

b

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux