Search Postgresql Archives

Re: Copy Bulk Ignore Duplicated

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Tim, thanks for you answer! 

The columns were just examples, but let me explain the database structure, the fields in bold are the keys:

customer_id integer
date_time timestamp
indicator_id integer
element_id integer
indicator_value double precision

The table is partitioned per day and customer_id (it works great) the problem is just the duplicated key situation that I'm really worried about.

I populate the database via a Java Application with JDBC.

Maybe this info could help to provide some light!

Thanks Again!

Leandro Guimarães 



On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophilusx@xxxxxxxxx> wrote:

Leandro Guimarães <leo.guimaraes@xxxxxxxxx> writes:

> Hi,
>
>    I have a scenario with a large table and I'm trying to insert it via a
> COPY command with a csv file.
>
>    Everything works, but sometimes my source .csv file has duplicated data
> in the previously fulfilled table. If I add a check constraint and try to
> run the COPY command I have an error that stops the whole insertion.
>
>   I've tried to put the data in a tmp table and fill the main using
> distinct this way (the fields and names are just examples):
>
> INSERT INTO final_table values (name, document)
>    SELECT DISTINCT name, document
>    FROM tmp_TABLE t1
>    WHERE NOT EXISTS (
>    SELECT 1 FROM final_table t2
>    WHERE (t2.name, t2.document)
>    IS NOT DISTINCT FROM (t1.name, t1.document))
>
> The problem is that my final_table is a large (and partitioned) table and
> this query is taking a long time to execute.
>
> Someone have any idea (really guys anything would be great) how to solve
> this situation? I need to ignore duplicates instead to have some error.
>
> I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and upgrade is not
> an option.
>

Explain plan would probably shed some light, but I suspect your
performance is being heavily hit by the sub query. Distinct is an
expensive operation and you are performing it once for every distinct row
in your temp table.

It isn't clear what the primary key is for your final table - name +
document seems suspicious given these seem to be the only two columns
your inserting as well. You don't indicate what the data types are
either - it document is something like 'text' then using it in a
distinct clause is likely to have huge performance impact.

The first thing I'd do is to eliminate duplicates from your temp table
as a separate statement or by pre-filtering the CSV before import. I
would then try something like an outer join to identify rows in your
temp table which don't exist in your final table and select from there
to insert into the final table. You don't really need the distinct in
the sub query as all you really need to know is if (name, document)
exists - it doesn't matter if more than one exists (for this test).

If you really don't have something more specific for a primary key,
depending on what data type 'document' is and how large it is, you may
find adding a column which is a checksum of your 'document' field a
useful addition. I have done this in the past where I had an application
where name was not unique and we only wanted distinct instances of
'document' (document was a fairly large XML document in this case).

--
Tim Cross



[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]

  Powered by Linux