On 06/25/2013 00:18, Ben wrote:
hello postgresql experts -- i have a strange row estimate problem, which unfortunately i have trouble reproducing for anything but very large tables which makes boiling down to a simple example hard. i'm using version 9.1.1, all tables have had analyze run on them. here's the example : i have a large table (billions of rows) with a five column primary key and a sixth value column. for confidentiality purposes i can't reproduce the exact schema here, but it is essentially create table bigtable ( id1 integer not null, id2 date not null, id3 integer not null, id4 time not null, id5 integer not null, value real not null, primary key (id1, id2, id3, id4, id5) ); for various reasons there is only one id1 in the table right now, though in the future there will be more; also the primary key was added after table creation with alter table, though that shouldn't matter. i need to select out a somewhat arbitrary collection of rows out of bigtable. to do so i generate a temporary table create table jointable ( id1 integer not null, id2 date not null, id3 integer not null, id4 time not null, id5 integer not null ); and then perform a join against this table. if jointable doesn't have many rows, the planner picks a nested loop over jointable and a primary key lookup on bigtable. in the following, for expository purposes, jointable has 10 rows. we can see the planner knows this. explain select * from bigtable join jointable using (id1, id2, id3, id4, id5); Nested Loop (cost=0.00..6321.03 rows=145 width=28) -> Seq Scan on jointable (cost=0.00..1.10 rows=10 width=24) -> Index Scan using bigtable_pkey on bigtable (cost=0.00..631.97 rows=1 width=28) Index Cond: ((id1 = jointable.id1) AND (id2 = jointable.id2) AND (id3 = jointable.id3) AND (id4 = jointable.id4) AND (vid = foo.vid)) (4 rows) as you increase the number of rows in jointable, the planner switches to a sort + merge. in this case jointable has roughly 2 million rows. Merge Join (cost=727807979.29..765482193.16 rows=18212633 width=28) Merge Cond: ((bigtable.id1 = jointabe.id1) AND (bigtable.id2 = jointable.id2) AND (bigtable.id3 = jointable.id3) AND (bigtable.id4 = bigtable.id4) AND (bigtable.id5 = bigtable.id5)) -> Sort (cost=727511795.16..735430711.00 rows=3167566336 width=28) Sort Key: bigtable.id3, bigtable.id1, bigtable.id2, bigtable.id4, bigtable.id5 -> Seq Scan on bigtable (cost=0.00..76085300.36 rows=3167566336 width=28) -> Materialize (cost=295064.70..305399.26 rows=2066911 width=24) -> Sort (cost=295064.70..300231.98 rows=2066911 width=24) Sort Key: jointable.id3, jointable.id1, jointable.id2, jointable.id4, jointable.id5 -> Seq Scan on jointable (cost=0.00..35867.11 rows=2066911 width=24) (9 rows)
can you show us the explain analyze version ?
the choice of sort + merge is really bad here, given the size of bigtable (3 billion rows and counting.) some questions : 1 - first off, why isn't the sort happening on the primary key, so that bigtable does not have to be sorted? 2 - more importantly, since we are joining on the primary key, shouldn't the row estimate coming out of the join be limited by the number of rows in jointable? for example, it is strange to see that if i put in a non-limiting limit statement (something bigger than the number of rows in jointable) it switches back to a nested loop + index scan : explain select * from bigtable join jointable using (id1, id2, id3, id4, id5) limit 2500000; Limit (cost=0.00..178452647.11 rows=2500000 width=28) -> Nested Loop (cost=0.00..1306127545.35 rows=18297957 width=28) -> Seq Scan on jointable (cost=0.00..35867.11 rows=2066911 width=24) -> Index Scan using bigtable_pkey on bigtable (cost=0.00..631.88 rows=1 width=28) Index Cond: ((id1 = jointable.id1) AND (id2 = jointable.id2) AND (id3 = jointable.id3) AND (id4 = jointable.id4) AND (id5 = jointable.id5)) (5 rows) am i not understanding the query planner, or is this a known issue in the query planner, or have i stumbled onto something amiss? unfortunately any synthetic examples i was able to make (up to 10 million rows) did not exhibit this behavior, which makes it hard to test. best regards, ben
-- No trees were killed in the creation of this message. However, many electrons were terribly inconvenienced. -- Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-performance