>I have an events table that records page views and purchases (type = 'viewed' or type='purchased'). I have a query that figures out "people who bought/viewed this also bought/viewed that". > >It worked fine, taking about 0.1 seconds to complete, until a few hours ago when it started taking hours to complete. Vacuum/analyze didn't help. Turned out there was one session_id that had 400k rows in the system. Deleting that made the query performant again. > >Is there anything I can do to make the query work better in cases like that? Missing index, or better query? > >This is on 9.3.5. > >The below is reproduced at the following URL if it's not formatted correctly in the email. https://gist.githubusercontent.com/joevandyk/cb8f4afdb6c1b178c606/raw/9940bbe033ebd56d38caa46e33c1ddfd9df36eda/gistfile1.txt Hello, here are 2 variations that should be somewhat faster. It seems you may have duplicate (site_id,session_id,product_id) which would false the result. In that case you'll need some more logic in the query. select '82503' as product_id, e2.site_id, e2.product_id, count(nullif(e2.type='viewed', false)) view_count, count(nullif(e2.type='purchased', false)) purchase_count from events e1 join events e2 on e1.session_id = e2.session_id and e1.type = e2.type where e1.product_id = '82503' and e2.product_id != '82503' group by e2.product_id, e2.site_id; OR: WITH SALL as( select e2.site_id, e2.product_id, count(nullif(e2.type='viewed', false)) view_count, count(nullif(e2.type='purchased', false)) purchase_count from events e1 join events e2 on e1.session_id = e2.session_id and e1.type = e2.type where e1.product_id = '82503' group by e2.product_id, e2.site_id ) SELECT '82503' as product_id_1, site_id, product_id, view_count, purchase_count FROM SALL WHERE product_id != '82503'; regards, Marc Mamin >explain select > e1.product_id, > e2.site_id, > e2.product_id, > count(nullif(e2.type='viewed', false)) view_count, > count(nullif(e2.type='purchased', false)) purchase_count > from events e1 > join events e2 on e1.session_id = e2.session_id and e1.type = e2.type > where > e1.product_id = '82503' and > e1.product_id != e2.product_id > group by e1.product_id, e2.product_id, e2.site_id; > QUERY PLAN >---------------------------------------------------------------------------------------------------------------------------- > GroupAggregate (cost=828395.67..945838.90 rows=22110 width=19) > -> Sort (cost=828395.67..840117.89 rows=4688885 width=19) > Sort Key: e1.product_id, e2.product_id, e2.site_id > -> Nested Loop (cost=11.85..20371.14 rows=4688885 width=19) > -> Bitmap Heap Scan on events e1 (cost=11.29..1404.31 rows=369 width=49) > Recheck Cond: (product_id = '82503'::citext) > -> Bitmap Index Scan on events_product_id_site_id_idx (cost=0.00..11.20 rows=369 width=0) > Index Cond: (product_id = '82503'::citext) > -> Index Scan using events_session_id_type_product_id_idx on events e2 (cost=0.56..51.28 rows=12 width=51) > Index Cond: ((session_id = e1.session_id) AND (type = e1.type)) > Filter: (e1.product_id <> product_id) >(11 rows) > >recommender_production=> \d events > Table "public.events" > Column | Type | Modifiers >-------------+--------------------------+----------------------------------------------------- > id | bigint | not null default nextval('events_id_seq'::regclass) > user_id | citext | > session_id | citext | not null > product_id | citext | not null > site_id | citext | not null > type | text | not null > happened_at | timestamp with time zone | not null > created_at | timestamp with time zone | not null >Indexes: > "events_pkey" PRIMARY KEY, btree (id) > "events_product_id_site_id_idx" btree (product_id, site_id) > "events_session_id_type_product_id_idx" btree (session_id, type, product_id) >Check constraints: > "events_session_id_check" CHECK (length(session_id::text) < 255) > "events_type_check" CHECK (type = ANY (ARRAY['purchased'::text, 'viewed'::text])) > "events_user_id_check" CHECK (length(user_id::text) < 255) > > > > |