Re: data import - duplicates

"Webb Sprague" <webb.sprague@xxxxxxxxx> · Sat, 8 Mar 2008 22:51:14 -0800

I haven't tested but this is what I would do (uses arrays, which are
handy when you need them), with the names changed to protect the
innocent:

begin;
-- create a table with some duplicates in one of the columns (y is ck);
wsprague=# select x, x%4 as y into temp fbar from generate_series(1,10) as x(a);
SELECT

wsprague=# select * from fbar;
 x  | y
----+---
  1 | 1
  2 | 2
  3 | 3
  4 | 0
  5 | 1
  6 | 2
  7 | 3
  8 | 0
  9 | 1
 10 | 2
(10 rows)

-- create a table with the pk, an array of the duplicate keys, and the
length of that array
select y, x_list, array_upper(x_list, 1) as x_len into baz from
(select y, array_accum(x) as x_list from fbar group by y ) a ;

wsprague=# select * from baz;
 y |  x_list  | x_len
---+----------+-------
 3 | {3,7}    |     2
 2 | {2,6,10} |     3
 1 | {1,5,9}  |     3
 0 | {4,8}    |     2
(4 rows)

-- delete all rows that don't have ck in the first element of the pk list
wsprague=# delete from fbar where not exists (select 1 from baz where
fbar.x=baz.x_list[1]);DELETE 6
wsprague=# select * from fbar;
 x | y
---+---
 1 | 1
 2 | 2
 3 | 3
 4 | 0
(4 rows)

commit;

I hope that makes sense.  It should be easy to make into a function
(like clean_table(table_name text, pk_name text, dup_key_name text).
I don't know how well it will work with indexes.  You could probably
even write a function to do the entire import inside postgres,
starting with a copy and moving to merging tables and ending with some
consistency checks, and thus benefit from  transaction isolation of
the whole process....

HTH

On Sat, Mar 8, 2008 at 9:42 PM, brian <brian@xxxxxxxxxxxxxxxx> wrote:
> I have ~350K rows of sample data that has come to me in 64 text files
>  (CSV) that I'd like to import into a new database. Each file can be
>  considered its own category and is so named. That is, each file name
>  will be inserted into a separate categories table. I'd like to relate
>  each row to its category. However, while the rows are unique in each
>  file, some rows are found in more than one file.
>
>  I also must parse some of the fields into separate columns, but that's
>  no big deal. But it means that I must do some pre-processing on these
>  files, in any case.
>
>  After some thought, I thought I might brute-force the problem with Perl
>  by reading each line of each file into an assoc. array unless it's
>  already in it, in which case I'd append the "key" based on the list of
>  categories that line is found in (I'd still need to parse outthe keys
>  later but I'd get rid of the dupes). Each array row would be like so:
>
>  'key' => '1,4,etc.', 'text' => 'a line'
>
>  Aside from the fact that the array search would become ridiculously
>  large after just a couple of files, I realised that this is a
>  non-starter for the simple fact that this data comprises less than 25%
>  of the eventual total. So refactoring it in this way would be a waste of
>  time (there will probably be dupes in the files to come).
>
>  So, I'd like to instead parse out my columns properly and write each
>  line (along with its category key) to a new, single file to be copied
>  into a working table. ONce I've done so, is there an efficient method I
>  can use to select all duplicates (save for the category key) into a set
>  from which I could then select into the final table (and insert the keys
>  into the category join table)?
>
>  For example (pk is the PK from the working table and ck is the category
>  key), my dupes query on the working table would give the following set:
>
>  pk  ck
>  1   1  a  a  a  a
>  2   3  a  a  a  a
>  3   3  b  b  b  b
>  4   7  b  b  b  b
>  5   4  a  a  a  a
>
>  I would then want to insert just the unique rows into the final table
>  yet add all of the the PKs and CKs to the category join table. After
>  that was done, I'd delete all of these from the working table and then
>  move the unique rows that are left to the final table (and insert the
>  keys into the join table).
>
>  I hope that makes sense. I'm not looking for anyone to do my homework
>  for me; I'm sure I could fix up a tasty function for this (the data is
>  destined for MySQL, alas, but I'll be damned if I won't use PG for the
>  heavy lifting). What I'm really looking for is a handy way to grab all
>  of those dupes.
>
>  --
>  Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
>  To make changes to your subscription:
>  http://www.postgresql.org/mailpref/pgsql-general
>

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general