Re: How to skip duplicate records while copying from CSV to table in Postgresql using "COPY"

Francisco Olarte <folarte@xxxxxxxxxxxxxx> · Sun, 24 May 2015 21:08:18 +0200

Hi Arup

On Sun, May 24, 2015 at 12:26 PM, Arup Rakshit
<aruprakshit@xxxxxxxxxxxxxx> wrote:
> I am copying the data from a CSV file to a Table using "COPY" command. But one thing that I got stuck, is how to skip duplicate records while copying from CSV to tables. By looking at the documentation, it seems, Postgresql don't have any inbuilt too to handle this with "copy" command. By doing Google I got below 1 idea to use temp table.
> http://stackoverflow.com/questions/13947327/to-ignore-duplicate-keys-during-copy-from-in-postgresql
> I am also thinking what if I let the records get inserted, and then delete the duplicate records from table as this post suggested - http://www.postgresql.org/message-id/37013500.DFF0A64A@xxxxxxxxxxxxxxxxxxxx.
> Both of the solution looks like doing double work. But I am not sure which is the best solution here. Can anybody suggest which approach should I adopt ? Or if any better ideas you guys have on this task, please share.

Without looking at your SO answers ( I suppose both consist on copying
to a temp table, then either doing an 'insert into main table select *
from temp where not exists...' or a delete from temp / insert into
main , which is what I would suggest ) I would point out copy is not
the tool for this kind of jobs. Copy is normally used for bulk loading
of correct data, and is great at this. If you need to preclean data,
you are not doing double work using one of those solutions. Many ways
of doing it have been pointed, the exact one depends on your data.
You've pointed there are duplicate rows, if this is true you can
easily do that using text tools and dumps or the temporary table.
There is also the solution of just turning your CSV with your favorite
text handiling tool into a set of insert lines conditioned of
inexistence of an appropiately indexed set of fields in the table (
which I've done but cannot quote exactly, it was an experiment and got
quite hairy ). From my experience, go for the temporary table plus
insert-select/delete+select route, it's easier, is normally faster (
supposing you have appropiate indexes ) and it really is no double
work. You have dirty data ( duplicates ), you have to clean it before
inserting.

Or, you could turn the file into a series of plain inserts and feed it
to psql in autocommit mode, if you have some unique condition on the
table, so it errors out on every duplicate. It's not pretty but should
work. I still recommend the temp table approach.

Francisco Olarte.

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general