Search Postgresql Archives

Most efficient way to insert without duplicates

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi all!

I track Twitter followers in my database. I have the following table:

# \d persona_followers
           Table "public.persona_followers"
   Column    |            Type             | Modifiers
-------------+-----------------------------+-----------
 service_id  | bigint                      | not null
 follower_id | bigint                      | not null
 valid_at    | timestamp without time zone |
Indexes:
    "persona_followers_pkey" PRIMARY KEY, btree (service_id, follower_id)

The table IS NOT partitioned.

I have a list of Twitter people I follow more - brands, actors, those kinds of Twitter accounts. They often have thousands, if not hundreds of thousands, of followers. I fetch the followers of these accounts about once a day. When it's time to insert into the database, I use the following algorithm:

CREATE TEMP TABLE import( service_id bigint, follower_id bigint );
COPY INTO import FROM STDIN;
...
\N

INSERT INTO persona_followers(service_id, follower_id, valid_at)
  SELECT service_id, follower_id, NOW()
  FROM (SELECT DISTINCT service_id, follower_id FROM import) AS import
  WHERE NOT EXISTS(SELECT * FROM persona_followers WHERE import.service_id = persona_followers.service_id AND import.follower_id = persona_followers.follower_id);

I currently have 660 million rows in persona_followers (47 GB). A test import is 13.5 million rows (571 MB). The real daily import will be at least 10x more. In a 24 hour period, I will have at most a few thousand *new* rows - the rest will already exist in persona_followers. How do I most efficiently eliminate the duplicates? Should I delete the duplicates in import? Or should I bite the bullet and EXCEPT the final table? Should I insert much smaller batches? Or is the above already the most efficient way? What other completely different data structure could I use to achieve my goal? I truly need the exhaustive list of followers because we do reach calculations (number of unique accounts which received a particular tweet).

The true answer is probably "benchmark on your own servers", but I'm looking for guidelines, people with the same kind of experience.

Thanks!
François

<<attachment: smime.p7s>>


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Index of Archives]     [Postgresql Jobs]     [Postgresql Admin]     [Postgresql Performance]     [Linux Clusters]     [PHP Home]     [PHP on Windows]     [Kernel Newbies]     [PHP Classes]     [PHP Books]     [PHP Databases]     [Postgresql & PHP]     [Yosemite]
  Powered by Linux