Search Postgresql Archives

Re: Most efficient way to insert without duplicates

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Le 2013-04-16 à 22:51, François Beausoleil a écrit :

> Hi all!
> 
> I track Twitter followers in my database. I have the following table:
> 
> # \d persona_followers
>           Table "public.persona_followers"
>   Column    |            Type             | Modifiers
> -------------+-----------------------------+-----------
> service_id  | bigint                      | not null
> follower_id | bigint                      | not null
> valid_at    | timestamp without time zone |
> Indexes:
>    "persona_followers_pkey" PRIMARY KEY, btree (service_id, follower_id)
> 
> The table IS NOT partitioned.

A run with the following query:

INSERT INTO "persona_followers"
  SELECT "service_id", "follower_id", now()
  FROM (
    SELECT *
    FROM (
        SELECT DISTINCT "service_id", "follower_id"
        FROM "persona_followers_import"
        WHERE "service_id" IS NOT NULL OR "follower_id" IS NOT NULL
      EXCEPT
         SELECT "service_id", "follower_id" FROM "persona_followers") AS "t1") AS "t1"

results in http://explain.depesz.com/s/Y1c

 Insert on public.persona_followers  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=4729255.535..4729255.535 rows=0 loops=1)
   Buffers: shared hit=33135295 read=4776921
   ->  Subquery Scan on t1  (cost=139261.12..20483497.65 rows=6256498 width=16) (actual time=562265.156..578844.999 rows=6819520 loops=1)
         Output: t1.service_id, t1.follower_id, now()
         Buffers: shared hit=36891 read=3572263
         ->  HashSetOp Except  (cost=139261.12..20389650.18 rows=6256498 width=16) (actual time=562265.127..566513.759 rows=6819520 loops=1)
               Output: "*SELECT* 1".service_id, "*SELECT* 1".follower_id, (0)
               Buffers: shared hit=36891 read=3572263
               ->  Append  (cost=139261.12..17054024.97 rows=667125042 width=16) (actual time=4090.462..320879.545 rows=667689321 loops=1)
                     Buffers: shared hit=36891 read=3572263
                     ->  Subquery Scan on "*SELECT* 1"  (cost=139261.12..264391.09 rows=6256498 width=16) (actual time=4090.461..7798.334 rows=6820784 loops=1)
                           Output: "*SELECT* 1".service_id, "*SELECT* 1".follower_id, 0
                           Buffers: shared hit=36891
                           ->  HashAggregate  (cost=139261.12..201826.11 rows=6256498 width=16) (actual time=4090.459..6795.009 rows=6820784 loops=1)
                                 Output: persona_followers_import.service_id, persona_followers_import.follower_id
                                 Buffers: shared hit=36891
                                 ->  Seq Scan on francois.persona_followers_import  (cost=0.00..105137.75 rows=6824675 width=16) (actual time=0.017..1344.916 rows=6824700 loops=1)
                                       Output: persona_followers_import.service_id, persona_followers_import.follower_id
                                       Filter: ((persona_followers_import.service_id IS NOT NULL) OR (persona_followers_import.follower_id IS NOT NULL))
                                       Buffers: shared hit=36891
                     ->  Subquery Scan on "*SELECT* 2"  (cost=0.00..16789633.88 rows=660868544 width=16) (actual time=6.694..238761.499 rows=660868537 loops=1)
                           Output: "*SELECT* 2".service_id, "*SELECT* 2".follower_id, 1
                           Buffers: shared read=3572263
                           ->  Seq Scan on public.persona_followers  (cost=0.00..10180948.44 rows=660868544 width=16) (actual time=6.693..137929.808 rows=660868537 loops=1)
                                 Output: public.persona_followers.service_id, public.persona_followers.follower_id
                                 Buffers: shared read=3572263
 Total runtime: 4729338.157 ms

1h20m for 6.8 million rows inserted. The estimates are spot on, since I had just run VACUUM ANALYZE on both tables prior to doing this test. Running the original query now, SELECT * FROM ... WHERE NOT EXISTS(...).

Bye,
François

<<attachment: smime.p7s>>


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux