Thanks everyone, the problem just solved itself. After the ANALYZE had finished, postgres started doing what I wanted it to do all along: EXPLAIN SELECT p.ItemID FROM Papers AS p, PaperReferences AS r WHERE p.itemID = r.ItemID AND p.issueID = 'A1983PW823'; "Nested Loop (cost=0.00..4515980.97 rows=2071811 width=16)" " -> Index Scan using idx_papers_issueid on papers p (cost=0.00..274.53 rows=508 width=16)" " Index Cond: (issueid = 'A1983PW823'::bpchar)" " -> Index Scan using paperreferences_pkey on paperreferences r (cost=0.00..8838.21 rows=4078 width=16)" " Index Cond: (r.itemid = p.itemid)" So thanks again. I'm starting to grasp the postgres quirks :) Jann Am 24.08.10 15:03, schrieb Jann Röder: > So that took a while... I'm currently running ANALYZE on the > PaperReferences table again (the one where I changed the data type). > > The plan however is still the same: > "Hash Join (cost=280.88..24330800.08 rows=670602240 width=16)" > " Hash Cond: (r.itemid = p.itemid)" > " -> Seq Scan on paperreferences r (cost=0.00..15109738.40 > rows=670602240 width=64)" > " -> Hash (cost=274.53..274.53 rows=508 width=16)" > " -> Index Scan using idx_papers_issueid on papers p > (cost=0.00..274.53 rows=508 width=16)" > " Index Cond: (issueid = 'A1983PW823'::bpchar)" > > But I can now force it to use an index scan instead of a seqScan: > "Merge Join (cost=0.00..2716711476.57 rows=670602240 width=16)" > " Merge Cond: (p.itemid = r.itemid)" > " -> Index Scan using papers_pkey on papers p (cost=0.00..21335008.47 > rows=508 width=16)" > " Filter: (issueid = 'A1983PW823'::bpchar)" > " -> Index Scan using paperreferences_pkey on paperreferences r > (cost=0.00..2686993938.83 rows=670602240 width=64)" > > Unfortunately this is not faster than the other one. I did not wait > until it returned because I want this query to take less than 5 seconds > or so. > > Here is my query again: > SELECT p.ItemID FROM Papers AS p, PaperReferences AS r WHERE p.itemID = > r.ItemID AND p.issueID = 'A1983PW823'; > > I can also write it as: > SELECT ItemID FROM PaperReferences WHERE ItemID IN (SELECT ItemID FROM > Papers WHERE IssueID = 'A1983PW823') > > Which is more what I would do if I was the database. Unfortunately this > is not fast either: > > "Hash Semi Join (cost=280.88..24330800.08 rows=670602240 width=64)" > " Hash Cond: (paperreferences.itemid = papers.itemid)" > " -> Seq Scan on paperreferences (cost=0.00..15109738.40 > rows=670602240 width=64)" > " -> Hash (cost=274.53..274.53 rows=508 width=16)" > " -> Index Scan using idx_papers_issueid on papers > (cost=0.00..274.53 rows=508 width=16)" > " Index Cond: (issueid = 'A1983PW823'::bpchar)" > > The sub-query SELECT ItemID FROM Papers WHERE IssueID = 'A1983PW823' is > really fast, though and returns 16 rows. If I unroll the query by hand > like this: > SELECT ItemID FROM PaperReferences WHERE > (ItemID = 'A1983PW82300001' OR > ItemID = 'A1983PW82300002' OR > ItemID = 'A1983PW82300003' OR > ItemID = 'A1983PW82300004' OR > ItemID = 'A1983PW82300005' OR > ItemID = 'A1983PW82300006' OR > ...) > > (All the ORed stuff is the result of the sub-query) I get my result > really fast. So what I need now is a way to tell postgres to do it that > way automatically. If everything else fails I will have to put that > logic into my application in java code, which I don't want to do because > then I will also have to remove my constraints so I can delete stuff at > a reasonable speed. > > Thanks, > Jann > > > Am 23.08.10 15:33, schrieb Kevin Grittner: >> Jann Röder<roederja@xxxxxxx> wrote: >> >>> Table "public.papers" >>> Column | Type | Modifiers >>> ------------------+-------------------------+----------- >>> itemid | character(15) | not null >> >>> wos-db=> \d PaperReferences >>> Table "public.paperreferences" >>> Column | Type | Modifiers >>> --------------------+-----------------------+----------- >>> itemid | character varying(15) | not null >> >>> I just noticed that PaperReferences uses character varying (15) >>> and Papers uses character(15). Stupid mistake of mine. Do you >>> think this might cause the bad query planning? >> >> Absolutely. These are *not* the same type and don't compare all >> that well. >> >>> I will alter the table to use character(15) in both cases and see >>> if that helps. >> >> I suspect that making them the same will cure the problem, but I >> would recommend you make any character(n) columns character >> varying(n) instead of the other way around. The the character(n) >> data type has many surprising behaviors and tends to perform worse. >> Avoid using it if possible. >> >>> postgresql.conf: >>> max_connections = 20 >>> shared_buffers = 256MB >>> work_mem = 10MB >>> maintenance_work_mem = 128MB >>> max_stack_depth = 4MB >>> synchronous_commit = off >>> wal_buffers = 1MB >>> checkpoint_segments = 10 >>> effective_cache_size = 768MB >>> default_statistics_target = 200 >>> datestyle = 'iso, mdy' >>> lc_messages = 'C' >>> lc_monetary = 'C' >>> lc_numeric = 'C' >>> lc_time = 'C' >>> default_text_search_config = 'pg_catalog.simple' >> >>> Do you need an EXPLAIN ANALYZE output? Since it takes so long I >>> can't easily post one right now. But maybe I can get one over >>> night. >> >> Not necessary; you've already identified the cause and the fix. >> >>> My Hardware is an iMac running OS X 10.6.4 with 1.5 GB RAM and a >>> 2.1 GHz (or so) core 2 Duo processor. >> >> OK. If you still don't get a good plan, you might want to try >> edging up effective_cache_size, if the sum of your shared_buffers >> and OS cache is larger than 768MB (which I would expect it might >> be). If the active part of your database (the part which is >> frequently referenced) fits within cache space, or even a >> significant portion of it fits, you might need to adjust >> random_page_cost and perhaps seq_page_cost to reflect the lower >> average cost of fetching from cache rather than disk -- but you want >> to fix your big problem (the type mismatch) first, and then see if >> you need further adjustments. >> >> -Kevin -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance