On 6/17/19 8:14 AM, Leandro Guimarães wrote:
Hi Adrian,
You are right, these fields are in CHECK CONSTRAiNTS and they are
not formally defined as Primary Keys.
Alright. Two things:
1) If you are are thinking of them as keys, why not make them a PK or a
UNIQUE index?
2) Still not clear to me whether you are looking for duplicated
information within a row or between rows?
To put it another way, what are the CHECK constraints doing?
Thanks!
Leandro Guimarães
On Sat, Jun 15, 2019 at 10:45 AM Adrian Klaver
<adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@xxxxxxxxxxx>> wrote:
On 6/14/19 7:24 PM, Leandro Guimarães wrote:
> Hi Tim, thanks for you answer!
>
> The columns were just examples, but let me explain the database
> structure, the fields in *bold are the keys*:
>
> *customer_id integer*
> *date_time timestamp*
> *indicator_id integer*
> *element_id integer*
> indicator_value double precision
Huh, earlier you said you had a check constraint that was causing
issues.
Does that also exist or where you referring to the keys above?
Are the keys above formally defined as the PRIMARY KEY?
-
>
> The table is partitioned per day and customer_id (it works great)
the
> problem is just the duplicated key situation that I'm really
worried about.
>
> I populate the database via a Java Application with JDBC.
>
> Maybe this info could help to provide some light!
>
> Thanks Again!
>
> Leandro Guimarães
>
>
>
> On Fri, Jun 14, 2019 at 7:39 PM Tim Cross <theophilusx@xxxxxxxxx
<mailto:theophilusx@xxxxxxxxx>
> <mailto:theophilusx@xxxxxxxxx <mailto:theophilusx@xxxxxxxxx>>> wrote:
>
>
> Leandro Guimarães <leo.guimaraes@xxxxxxxxx
<mailto:leo.guimaraes@xxxxxxxxx>
> <mailto:leo.guimaraes@xxxxxxxxx
<mailto:leo.guimaraes@xxxxxxxxx>>> writes:
>
> > Hi,
> >
> > I have a scenario with a large table and I'm trying to
insert
> it via a
> > COPY command with a csv file.
> >
> > Everything works, but sometimes my source .csv file has
> duplicated data
> > in the previously fulfilled table. If I add a check constraint
> and try to
> > run the COPY command I have an error that stops the whole
insertion.
> >
> > I've tried to put the data in a tmp table and fill the
main using
> > distinct this way (the fields and names are just examples):
> >
> > INSERT INTO final_table values (name, document)
> > SELECT DISTINCT name, document
> > FROM tmp_TABLE t1
> > WHERE NOT EXISTS (
> > SELECT 1 FROM final_table t2
> > WHERE (t2.name <http://t2.name> <http://t2.name>,
t2.document)
> > IS NOT DISTINCT FROM (t1.name <http://t1.name>
<http://t1.name>, t1.document))
> >
> > The problem is that my final_table is a large (and
partitioned)
> table and
> > this query is taking a long time to execute.
> >
> > Someone have any idea (really guys anything would be
great) how
> to solve
> > this situation? I need to ignore duplicates instead to
have some
> error.
> >
> > I'm using* PostgreSQL 9.4* so I can't use "ON CONFLICT" and
> upgrade is not
> > an option.
> >
>
> Explain plan would probably shed some light, but I suspect your
> performance is being heavily hit by the sub query. Distinct is an
> expensive operation and you are performing it once for every
> distinct row
> in your temp table.
>
> It isn't clear what the primary key is for your final table -
name +
> document seems suspicious given these seem to be the only two
columns
> your inserting as well. You don't indicate what the data
types are
> either - it document is something like 'text' then using it in a
> distinct clause is likely to have huge performance impact.
>
> The first thing I'd do is to eliminate duplicates from your
temp table
> as a separate statement or by pre-filtering the CSV before
import. I
> would then try something like an outer join to identify rows
in your
> temp table which don't exist in your final table and select
from there
> to insert into the final table. You don't really need the
distinct in
> the sub query as all you really need to know is if (name,
document)
> exists - it doesn't matter if more than one exists (for this
test).
>
> If you really don't have something more specific for a
primary key,
> depending on what data type 'document' is and how large it
is, you may
> find adding a column which is a checksum of your 'document'
field a
> useful addition. I have done this in the past where I had an
application
> where name was not unique and we only wanted distinct
instances of
> 'document' (document was a fairly large XML document in this
case).
>
> --
> Tim Cross
>
>
--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx <mailto:adrian.klaver@xxxxxxxxxxx>
--
Adrian Klaver
adrian.klaver@xxxxxxxxxxx