Re: Copy Bulk Ignore Duplicated

Adrian Klaver <adrian.klaver@xxxxxxxxxxx> · Sat, 15 Jun 2019 06:44:55 -0700

On 6/14/19 7:24 PM, Leandro Guimarães wrote:
Hi Tim, thanks for you answer!

The columns were just examples, but let me explain the database 
structure, the fields in *bold are the keys*:

*customer_id integer*
*date_time timestamp*
*indicator_id integer*
*element_id integer*
indicator_value double precision

Huh, earlier you said you had a check constraint that was causing issues.

Does that also exist or where you referring to the keys above?

Are the keys above formally defined as the PRIMARY KEY?
-

The table is partitioned per day and customer_id (it works great) the 
problem is just the duplicated key situation that I'm really worried about.

I populate the database via a Java Application with JDBC.

Maybe this info could help to provide some light!

Thanks Again!

Leandro Guimarães

On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophilusx@xxxxxxxxx 
<mailto:theophilusx@xxxxxxxxx>> wrote:

    Leandro Guimarães <leo.guimaraes@xxxxxxxxx
    <mailto:leo.guimaraes@xxxxxxxxx>> writes:

     > Hi,
     >
     >    I have a scenario with a large table and I'm trying to insert
    it via a
     > COPY command with a csv file.
     >
     >    Everything works, but sometimes my source .csv file has
    duplicated data
     > in the previously fulfilled table. If I add a check constraint
    and try to
     > run the COPY command I have an error that stops the whole insertion.
     >
     >   I've tried to put the data in a tmp table and fill the main using
     > distinct this way (the fields and names are just examples):
     >
     > INSERT INTO final_table values (name, document)
     >    SELECT DISTINCT name, document
     >    FROM tmp_TABLE t1
     >    WHERE NOT EXISTS (
     >    SELECT 1 FROM final_table t2
     >    WHERE (t2.name <http://t2.name>, t2.document)
     >    IS NOT DISTINCT FROM (t1.name <http://t1.name>, t1.document))
     >
     > The problem is that my final_table is a large (and partitioned)
    table and
     > this query is taking a long time to execute.
     >
     > Someone have any idea (really guys anything would be great) how
    to solve
     > this situation? I need to ignore duplicates instead to have some
    error.
     >
     > I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and
    upgrade is not
     > an option.
     >

    Explain plan would probably shed some light, but I suspect your
    performance is being heavily hit by the sub query. Distinct is an
    expensive operation and you are performing it once for every
    distinct row
    in your temp table.

    It isn't clear what the primary key is for your final table - name +
    document seems suspicious given these seem to be the only two columns
    your inserting as well. You don't indicate what the data types are
    either - it document is something like 'text' then using it in a
    distinct clause is likely to have huge performance impact.

    The first thing I'd do is to eliminate duplicates from your temp table
    as a separate statement or by pre-filtering the CSV before import. I
    would then try something like an outer join to identify rows in your
    temp table which don't exist in your final table and select from there
    to insert into the final table. You don't really need the distinct in
    the sub query as all you really need to know is if (name, document)
    exists - it doesn't matter if more than one exists (for this test).

    If you really don't have something more specific for a primary key,
    depending on what data type 'document' is and how large it is, you may
    find adding a column which is a checksum of your 'document' field a
    useful addition. I have done this in the past where I had an application
    where name was not unique and we only wanted distinct instances of
    'document' (document was a fairly large XML document in this case).

    --
    Tim Cross

--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx