On Fri, Nov 25, 2011 at 11:05 AM, Rich Shepard <rshepard@xxxxxxxxxxxxxxx> wrote:
The data originated in a spreadsheet and, based on my experience, contains
duplicate records. After reformatting there are 143,260 rows to insert in
the table. The approach I tried seems to have problems (explained below) and
I would like to learn the proper way to insert rows in either an empty table
or one with existing rows since I'll need to do this procedure for my
projects.
The table was created with the primary key and I used INSERT INTO ... to
load the data. Many duplicate records, so I split the file into smaller ones
and re-ran the command to load them. I'd then remove the reported duplicate
rows from the source (text) file. Between runs, I'd issue the DROP TABLE
command within psql and check it was empty using 'select count(*) from
waterchem;'.
It appeared that I removed duplicates from the first couple of smaller
files so I combined them into one file named ok.sql. But, when I tested the
combined file it, too, reported many duplicate records. Something wrong
here.
Perhaps a better approach is to put the CREATE TABLE command above the
INSERT INTO section of the file (without specifying a PK), load that using
the command 'psql -d <database> -f waterchem.sql', then add the PK and
remove duplicates as postgres presents them. Or, perhaps there is a much
more efficient way to accomplish this task.
Rather than my flailing around and spending a lot of time failing to load
all non-duplicate rows into the table I'd like to learn the proper way to
accomplish this task. Suggestions, recommendations, and your experiences are
requested.
Why don't you first load the data into a table (no primary key), then use SQL to find your dups?
once loaded:
SELECT <primary_key_column>, count(1) from <table> group by 1 having count(1) > 1;
At least then, you'll really know what you're in for. You can either script a DELETE or... whatever you want to do, once clean, you can add the PK.
--
Scott
TIA,
Rich
--
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general