Re: Copy Bulk Ignore Duplicated

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Mon, 17 Jun 2019 08:45:47 -0700

On 6/17/19 8:14 AM, Leandro Guimarães wrote:
Hi Adrian,

    You are right, these fields are in CHECK CONSTRAiNTS and they are 
not formally defined as Primary Keys.

Alright. Two things:

1) If you are are thinking of them as keys, why not make them a PK or a 
UNIQUE index?

2) Still not clear to me whether you are looking for duplicated 
information within a row or between rows?
To put it another way, what are the CHECK constraints doing?

Thanks!
Leandro Guimarães

On Sat, Jun 15, 2019 at 10:45 AM Adrian Klaver 
<adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@xxxxxxxxxxx>> wrote:

    On 6/14/19 7:24 PM, Leandro Guimarães wrote:
     > Hi Tim, thanks for you answer!
     >
     > The columns were just examples, but let me explain the database
     > structure, the fields in *bold are the keys*:
     >
     > *customer_id integer*
     > *date_time timestamp*
     > *indicator_id integer*
     > *element_id integer*
     > indicator_value double precision

    Huh, earlier you said you had a check constraint that was causing
    issues.

    Does that also exist or where you referring to the keys above?

    Are the keys above formally defined as the PRIMARY KEY?
    -
     >
     > The table is partitioned per day and customer_id (it works great)
    the
     > problem is just the duplicated key situation that I'm really
    worried about.
     >
     > I populate the database via a Java Application with JDBC.
     >
     > Maybe this info could help to provide some light!
     >
     > Thanks Again!
     >
     > Leandro Guimarães
     >
     >
     >
     > On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophilusx@xxxxxxxxx
    <mailto:theophilusx@xxxxxxxxx>
     > <mailto:theophilusx@xxxxxxxxx <mailto:theophilusx@xxxxxxxxx>>> wrote:
     >
     >
     >     Leandro Guimarães <leo.guimaraes@xxxxxxxxx
    <mailto:leo.guimaraes@xxxxxxxxx>
     >     <mailto:leo.guimaraes@xxxxxxxxx
    <mailto:leo.guimaraes@xxxxxxxxx>>> writes:
     >
     >      > Hi,
     >      >
     >      >    I have a scenario with a large table and I'm trying to
    insert
     >     it via a
     >      > COPY command with a csv file.
     >      >
     >      >    Everything works, but sometimes my source .csv file has
     >     duplicated data
     >      > in the previously fulfilled table. If I add a check constraint
     >     and try to
     >      > run the COPY command I have an error that stops the whole
    insertion.
     >      >
     >      >   I've tried to put the data in a tmp table and fill the
    main using
     >      > distinct this way (the fields and names are just examples):
     >      >
     >      > INSERT INTO final_table values (name, document)
     >      >    SELECT DISTINCT name, document
     >      >    FROM tmp_TABLE t1
     >      >    WHERE NOT EXISTS (
     >      >    SELECT 1 FROM final_table t2
     >      >    WHERE (t2.name <http://t2.name> <http://t2.name>,
    t2.document)
     >      >    IS NOT DISTINCT FROM (t1.name <http://t1.name>
    <http://t1.name>, t1.document))
     >      >
     >      > The problem is that my final_table is a large (and
    partitioned)
     >     table and
     >      > this query is taking a long time to execute.
     >      >
     >      > Someone have any idea (really guys anything would be
    great) how
     >     to solve
     >      > this situation? I need to ignore duplicates instead to
    have some
     >     error.
     >      >
     >      > I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and
     >     upgrade is not
     >      > an option.
     >      >
     >
     >     Explain plan would probably shed some light, but I suspect your
     >     performance is being heavily hit by the sub query. Distinct is an
     >     expensive operation and you are performing it once for every
     >     distinct row
     >     in your temp table.
     >
     >     It isn't clear what the primary key is for your final table -
    name +
     >     document seems suspicious given these seem to be the only two
    columns
     >     your inserting as well. You don't indicate what the data
    types are
     >     either - it document is something like 'text' then using it in a
     >     distinct clause is likely to have huge performance impact.
     >
     >     The first thing I'd do is to eliminate duplicates from your
    temp table
     >     as a separate statement or by pre-filtering the CSV before
    import. I
     >     would then try something like an outer join to identify rows
    in your
     >     temp table which don't exist in your final table and select
    from there
     >     to insert into the final table. You don't really need the
    distinct in
     >     the sub query as all you really need to know is if (name,
    document)
     >     exists - it doesn't matter if more than one exists (for this
    test).
     >
     >     If you really don't have something more specific for a
    primary key,
     >     depending on what data type 'document' is and how large it
    is, you may
     >     find adding a column which is a checksum of your 'document'
    field a
     >     useful addition. I have done this in the past where I had an
    application
     >     where name was not unique and we only wanted distinct
    instances of
     >     'document' (document was a fairly large XML document in this
    case).
     >
     >     --
     >     Tim Cross
     >
     >

    -- 
    Adrian Klaver
    adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@xxxxxxxxxxx>

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx