Search Postgresql Archives

data import - duplicates

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



I have ~350K rows of sample data that has come to me in 64 text files (CSV) that I'd like to import into a new database. Each file can be considered its own category and is so named. That is, each file name will be inserted into a separate categories table. I'd like to relate each row to its category. However, while the rows are unique in each file, some rows are found in more than one file.

I also must parse some of the fields into separate columns, but that's no big deal. But it means that I must do some pre-processing on these files, in any case.

After some thought, I thought I might brute-force the problem with Perl by reading each line of each file into an assoc. array unless it's already in it, in which case I'd append the "key" based on the list of categories that line is found in (I'd still need to parse outthe keys later but I'd get rid of the dupes). Each array row would be like so:

'key' => '1,4,etc.', 'text' => 'a line'

Aside from the fact that the array search would become ridiculously large after just a couple of files, I realised that this is a non-starter for the simple fact that this data comprises less than 25% of the eventual total. So refactoring it in this way would be a waste of time (there will probably be dupes in the files to come).

So, I'd like to instead parse out my columns properly and write each line (along with its category key) to a new, single file to be copied into a working table. ONce I've done so, is there an efficient method I can use to select all duplicates (save for the category key) into a set from which I could then select into the final table (and insert the keys into the category join table)?

For example (pk is the PK from the working table and ck is the category key), my dupes query on the working table would give the following set:

pk  ck
1   1  a  a  a  a
2   3  a  a  a  a
3   3  b  b  b  b
4   7  b  b  b  b
5   4  a  a  a  a

I would then want to insert just the unique rows into the final table yet add all of the the PKs and CKs to the category join table. After that was done, I'd delete all of these from the working table and then move the unique rows that are left to the final table (and insert the keys into the join table).

I hope that makes sense. I'm not looking for anyone to do my homework for me; I'm sure I could fix up a tasty function for this (the data is destined for MySQL, alas, but I'll be damned if I won't use PG for the heavy lifting). What I'm really looking for is a handy way to grab all of those dupes.

--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux