Hi! Le 2013-04-17 à 14:15, Jeff Janes a écrit :
Insert on public.persona_followers (cost=149905.33..149906.58 rows=100 width=24) (actual time=19.837..19.837 rows=0 loops=1) Buffers: shared hit=206, local hit=1 read=105 -> Sort (cost=149905.33..149905.58 rows=100 width=24) (actual time=19.534..19.536 rows=6 loops=1) Output: persona_followers_import.service_id, persona_followers_import.follower_id, (min(persona_followers_import.valid_at)) Sort Key: persona_followers_import.follower_id Sort Method: quicksort Memory: 25kB Buffers: shared hit=176, local hit=1 read=105 -> HashAggregate (cost=149901.01..149902.01 rows=100 width=24) (actual time=19.514..19.526 rows=6 loops=1) Output: persona_followers_import.service_id, persona_followers_import.follower_id, min(persona_followers_import.valid_at) Buffers: shared hit=176, local hit=1 read=105 -> Bitmap Heap Scan on pg_temp_35.persona_followers_import (cost=93051.86..149734.25 rows=22234 width=24) (actual time=14.350..19.505 rows=6 loops=1) Output: persona_followers_import.service_id, persona_followers_import.follower_id, persona_followers_import.valid_at Recheck Cond: ((persona_followers_import.service_id = 362513855) AND (persona_followers_import.follower_id IS NOT NULL)) Filter: (NOT (hashed SubPlan 1)) Buffers: shared hit=176, local hit=1 read=105 -> Bitmap Index Scan on persona_followers_import_service_id (cost=0.00..1134.32 rows=44469 width=0) (actual time=1.752..1.752 rows=10000 loops=1) Index Cond: ((persona_followers_import.service_id = 362513855) AND (persona_followers_import.follower_id IS NOT NULL)) Buffers: local hit=1 read=40 SubPlan 1 -> Bitmap Heap Scan on public.persona_followers (cost=661.54..91851.35 rows=24252 width=8) (actual time=2.309..6.400 rows=14317 loops=1) Output: public.persona_followers.follower_id Recheck Cond: (public.persona_followers.service_id = 362513855) Buffers: shared hit=176 -> Bitmap Index Scan on persona_followers_pkey (cost=0.00..655.48 rows=24252 width=0) (actual time=2.284..2.284 rows=14317 loops=1) Index Cond: (public.persona_followers.service_id = 362513855) Buffers: shared hit=88 Total runtime: 19.917 ms Runtime is under 20 milliseconds, per imported service_id. I have a few thousand such items per day, and that's fine. The final script looks like this: CREATE TEMPORARY TABLE persona_followers_import( service_id bigint, follower_id bigint ); COPY TO persona_followers_import FROM stdin; ... \. CREATE INDEX index_persona_followers_import_on_service_id ON persona_followers_import(service_id, follower_id); service_ids := SELECT DISTINCT service_id FROM persona_followers_import; for each service_id in service_ids: EXPLAIN ( ANALYZE, VERBOSE, COSTS, BUFFERS ) INSERT INTO persona_followers(service_id, follower_id, valid_at) SELECT service_id, follower_id, MIN(valid_at) FROM persona_followers_import WHERE follower_id IS NOT NULL AND follower_id NOT IN (SELECT follower_id FROM persona_followers WHERE service_id = :service_id) AND service_id = :service_id GROUP BY service_id, follower_id ORDER BY follower_id This seems to give me the best possible throughput. I was able to import days of data in an hour, compared to hours of work for one day of data. Thanks for all suggestions, and PostgreSQL rocks! François Beausoleil |
<<attachment: smime.p7s>>