I have a problem with query optimization. Since it's a bit hard to explain, I'll describe the database structure and list the queries first. The database holds contacts information for a call center. The general idea is to have operators call a random contact and record the results. Rules for picking a contact are: - contacts that have an open call shall not be called - contacts that have been called and refused to participate in current poll shall not be called again - contacts that have been called and did not answer or line was busy will be called again after 4 hours - contact shall not be attempted more than max_call_times per poll Contacts are joined in groups and groups are assigned to poll samples. Each poll has several groups assigned, and each usergroup can have different sample groups. Tables sizes at this time are roughly: contact - 1.5M call - 0.5M group - 10K poll - 10 contact_group - 5M (each contact belongs to ~ 3 groups) Tables last_call and num_calls have just the last call to a contact per poll , and number of calls to a person per poll respectively. All tables have foreign keys and appropriate indices. SELECT INTO s_contact contact.id_contact FROM contact JOIN contact_group ON contact.id_contact = contact_group.id_contact JOIN group_poll ON contact_group.id_group = group_poll.id_group AND group_poll.id_poll = $$s_poll$$ JOIN group ON group.id_group = group_poll.id_group JOIN user_role ON group_poll.id_role = user_role.id_role AND user_role.id_user = $$s_user$$ LEFT OUTER JOIN last_call ON last_call.id_contact = contact.id_contact LEFT OUTER JOIN num_calls ON num_calls.id_contact = contact.id_contact AND num_calls.id_poll = $$s_poll$$ WHERE -- temporal limitations ( ( group_poll.valid_from IS NULL ) OR ( group_poll.valid_from <= NOW() ) ) AND ( ( group_poll.valid_to IS NULL ) OR ( group_poll.valid_to >= NOW() ) ) AND ( ( group.valid_from IS NULL ) OR ( group.valid_from <= NOW() ) ) AND ( ( group.valid_to IS NULL ) OR ( group.valid_to >= NOW() ) ) AND ( ( contact.valid_from IS NULL ) OR ( contact.valid_from <= NOW() ) ) AND ( ( contact.valid_to IS NULL ) OR ( contact.valid_to >= NOW() ) ) AND ( -- no call last_call.id_call IS NULL OR ( -- no call in progress and not processed NOT(last_call.id_call_status = call_in_progress OR (last_call.sif_anketa = s_poll AND last_call.id_call_status IN ( call_completed, call_aborted, will_not_participate )) ) AND last_call.end_time < NOW()-'240 minutes'::interval ) ) AND ( num_calls.num_calls IS NULL OR num_calls.num_calls < $$max_num_calls$$ ) AND -- randomize contact.random_number <= $$random_key$$ ORDER BY contact.random_number DESC LIMIT 1 FOR UPDATE OF contact; The randomization is performed with a random number column in contacts table. Variables with $$ are inputs to functions. This query has an interesting property that it finishes with lightning speed, seemingly if there is a lot of contacts that cannot be joined with last call and num_calls. But it also goes to the other extreme and take up to 30s to execute. Limit (cost=0.00..17.64 rows=1 width=18) (actual time=0.205..0.205 rows=1 loops=1) -> Nested Loop (cost=0.00..1615461.74 rows=91598 width=18) (actual time=0.201..0.201 rows=1 loops=1) -> Nested Loop (cost=0.00..1536111.58 rows=42092 width=22) (actual time=0.179..0.179 rows=1 loops=1) -> Nested Loop Left Join (cost=0.00..1523725.29 rows=42092 width=30) (actual time=0.159..0.159 rows=1 loops=1) Filter: ((num_calls.num_calls IS NULL) OR (num_calls.num_calls < 3)) -> Nested Loop (cost=0.00..1508293.43 rows=48997 width=30) (actual time=0.139..0.139 rows=1 loops=1) -> Nested Loop (cost=0.00..1089736.15 rows=1430554 width=22) (actual time=0.120..0.120 rows=1 loops=1) -> Nested Loop Left Join (cost=0.00..610920.91 rows=635050 width=18) (actual time=0.088..0.088 rows=1 loops=1) Filter: ((last_call.id_poziv IS NULL) OR ((last_call.id_call_status <> 1) AND ((last_call.id_poll <> 1) OR (last_call.id_call_status <> ALL ('{3,8,9}'::integer[]))) AND (last_call.end_time < (now() - '04:00:00'::interval)))) -> Index Scan Backward using ix_contact_random_number on contact (cost=0.00..191996.83 rows=1266570 width=18) (actual time=0.056..0.056 rows=1 loops=1) Index Cond: (random_number <= 0.98::double precision) Filter: (((valid_from IS NULL) OR (valid_from <= now())) AND ((valid_to IS NULL) OR (valid_to
= now())))
-> Index Scan using fki_last_call_2 on last_call (cost=0.00..0.30 rows=1 width=24) (actual time=0.023..0.023 rows=0 loops=1) Index Cond: (last_call.id_contact = contact.id_contact) -> Index Scan using fki_contact_group_2 on contact_group (cost=0.00..0.73 rows=2 width=8) (actual time=0.026..0.026 rows=1 loops=1) Index Cond: (contact.id_contact = contact_group.id_contact) -> Index Scan using ix_group_poll_id_group on group_poll (cost=0.00..0.28 rows=1 width=8) (actual time=0.015..0.015 rows=1 loops=1) Index Cond: (contact_group.id_group = group_poll.id_group) Filter: ((id_poll = 1) AND ((valid_from IS NULL) OR (valid_from <= now())) AND ((valid_to IS NULL) OR (valid_to >= now()))) -> Index Scan using fki_num_calls_1 on num_calls (cost=0.00..0.30 rows=1 width=8) (actual time=0.016..0.016 rows=0 loops=1) Index Cond: (num_calls.id_contact = contact.id_contact) Filter: (id_poll = 1) -> Index Scan using group_pkey on group (cost=0.00..0.28 rows=1 width=4) (actual time=0.015..0.015 rows=1 loops=1) Index Cond: (group.id_group = group_poll.id_group) Filter: (((valid_from IS NULL) OR (valid_from <= now())) AND ((valid_to IS NULL) OR (valid_to >= now()))) -> Index Scan using uq_user_role_id_user_id_role on user_role (cost=0.00..1.87 rows=1 width=4) (actual time=0.018..0.018 rows=1 loops=1) Index Cond: ((user_role.id_user = 1) AND (group_poll.id_role = user_role.id_role)) Total runtime: 0.972 ms The slow plan (most of the contacts have already been attempted, moving to 2nd and 3rd calls, group size 20K): Limit (cost=10.19..313.21 rows=1 width=18) (actual time=166.575..166.576 rows=1 loops=1) -> Nested Loop (cost=10.19..1156643.83 rows=3817 width=18) (actual time=166.571..166.571 rows=1 loops=1) -> Nested Loop Left Join (cost=10.19..1155520.62 rows=3817 width=26) (actual time=166.553..166.553 rows=1 loops=1) Filter: ((num_calls.num_calls IS NULL) OR (num_calls.num_calls < 3)) -> Nested Loop (cost=10.19..1154121.27 rows=4443 width=26) (actual time=24.318..166.259 rows=15 loops=1) Join Filter: (contact_group.id_group = group_poll.id_group) -> Nested Loop (cost=0.00..1089736.15 rows=1430554 width=22) (actual time=0.082..124.821 rows=8567 loops=1) -> Nested Loop Left Join (cost=0.00..610920.91 rows=635050 width=18) (actual time=0.060..56.886 rows=3768 loops=1) Filter: ((last_call.id_poziv IS NULL) OR ((last_call.id_call_status <> 1) AND ((last_call.id_poll <> 1) OR (last_call.id_call_status <> ALL ('{3,8,9}'::integer[]))) AND (last_call.end_time < (now() - '04:00:00'::interval)))) -> Index Scan Backward using ix_contact_random_number on contact (cost=0.00..191996.83 rows=1266570 width=18) (actual time=0.037..15.929 rows=4072 loops=1) Index Cond: (random_number <= 0.98::double precision) Filter: (((valid_from IS NULL) OR (valid_from <= now())) AND ((valid_to IS NULL) OR (valid_to >= now()))) -> Index Scan using fki_last_call_2 on last_call (cost=0.00..0.30 rows=1 width=24) (actual time=0.007..0.007 rows=0 loops=4072) Index Cond: (last_call.id_contact = contact.id_contact) -> Index Scan using fki_contact_group_2 on contact_group (cost=0.00..0.73 rows=2 width=8) (actual time=0.010..0.014 rows=2 loops=3768) Index Cond: (contact.id_contact = contact_group.id_contact) -> Materialize (cost=10.19..10.21 rows=2 width=4) (actual time=0.000..0.002 rows=3 loops=8567) -> Hash Join (cost=8.28..10.19 rows=2 width=4) (actual time=0.106..0.132 rows=3 loops=1) Hash Cond: (group_poll.id_role = user_role.id_role) -> Seq Scan on group_poll (cost=0.00..1.81 rows=20 width=8) (actual time=0.026..0.071 rows=25 loops=1) Filter: ((id_poll = 1) AND ((valid_from IS NULL) OR (valid_from <= now())) AND ((valid_to IS NULL) OR (valid_to >= now()))) -> Hash (cost=8.27..8.27 rows=1 width=4) (actual time=0.017..0.017 rows=1 loops=1) -> Index Scan using fki_user_role_1 on user_role (cost=0.00..8.27 rows=1 width=4) (actual time=0.011..0.013 rows=1 loops=1) Index Cond: (id_user = 265) -> Index Scan using fki_num_calls_1 on num_calls (cost=0.00..0.30 rows=1 width=8) (actual time=0.014..0.015 rows=1 loops=15) Index Cond: (num_calls.id_contact = contact.id_contact) Filter: (id_poll = 1) -> Index Scan using group_pkey on group (cost=0.00..0.28 rows=1 width=4) (actual time=0.012..0.012 rows=1 loops=1) Index Cond: (group.id_group = group_poll.id_group) Filter: (((valid_from IS NULL) OR (valid_from <= now())) AND ((valid_to IS NULL) OR (valid_to >= now()))) Total runtime: 166.963 ms This can also be slower, up to 10 seconds.. Query executes within a function, which is synchronized from the application, so only one such user has access to it. So, the question is - what can cause such slowdown? Working set is round 380M, shared_mem has been set to 400M, and work_mem to 256M, when testing (single-user), there is no swapping by postgres. When several users access the database (avg 30), sometimes it slows down, but it's hard to tell why. Nothing in the logs. Autovacuum has been enables, statistics (row, block) also. Db is also slow in some string operations (split_part() can sometimes slow down to 100 rows per second). Tried several changes to the query to make it correct in all respects - cannot match the speed of this one in case there is lots of free contacts. I thought of having two queries and switch between them when the poll comes close to an end, but in some cases there is no slowdown. Any pointers on what to look for? The machine is 2x Xeon 3GHz, 2G RAM, Win 2003, postgresql 8.2.3 (installer from postgresql.org). Running apps - just postgresql and asp.net. Encoding - UTF8. Clients use Npgsql to connect via LLBLGen O/R mapper. To anyone who got this far, a sincere thanks. Suggestions and comments are a bonus. Cheers, t.n.a.