Hi everyone, I've just noticed a strange behaviour when estimating row counts (I'm running 9.0.1). A small demonstration - let's create table with two columns, and fill it with data so that the columns are not independent: ===================================================================== -- table with two columns create table test_table (col_a int, col_b int); -- fill it with correlated data (0..99, 0..199) insert into test_table select i%100, i%200 from generate_series(1, 1000000) s(i); -- update statistics analyze test_table; ===================================================================== OK, not let's run a few simple queries, producing exactly the same output but very different plans: A) SELECT * FROM test_table WHERE col_a = 33 AND col_b = 33; Seq Scan on test_table (cost=0.00..19425.00 rows=47 width=8) (actual time=0.025..216.273 rows=5000 loops=1) Filter: ((col_a = 33) AND (col_b = 33)) Total runtime: 221.676 ms B) SELECT * FROM test_table WHERE (col_a, col_b) = (33, 33); plan is exactly the same as (A) C) SELECT * FROM test_table WHERE (col_a BETWEEN 33 AND 33) AND (col_b BETWEEN 33 AND 33); Seq Scan on test_table (cost=0.00..24425.00 rows=1 width=8) (actual time=0.025..282.725 rows=5000 loops=1) Filter: ((col_a >= 33) AND (col_a <= 33) AND (col_b >= 33) AND (col_b <= 33)) Total runtime: 288.127 ms D) SELECT * FROM test_table WHERE (col_a, col_b) BETWEEN (33, 33) AND (33, 33); Seq Scan on test_table (cost=0.00..24425.00 rows=227232 width=8) (actual time=0.022..238.958 rows=5000 loops=1) Filter: ((ROW(col_a, col_b) >= ROW(33, 33)) AND (ROW(col_a, col_b) <= ROW(33, 33))) Total runtime: 244.353 ms ===================================================================== So the estimated number of rows is this A) 47 B) 47 C) 1 D) 227232 Results from (A) and (B) seem strange to me because AFAIK there are no multi-column statistics available, and accoring to this thread http://archives.postgresql.org/pgsql-hackers/2009-03/msg00052.php the single-column estimates are not multiplied (which would be OK only in case of statistically independent columns). Yet the estimates somehow match the product: 1.000.000 * (1/200) * (1/100) = 1.000.000 / 20.000 = 50 I'm not quite sure why (C) has an estimate of 1. The col_a has only 100 distinct values, so it uses most_common_vals/most_common_freqs, and all the values will be there (statistics target is 100) along with frequencies. This gives about 1% selectivity. Column col_b has 200 distinct values, uniformly distributed, so the estimates are based on a histogram - there are 100 bins, the range fits into a single bin, giving 1% selectivity. But no matter what I do, I'm not sure how to combine these two estimates into 0.0001% (1 row out of a million). And I do have exactly the same problem with the estimate in (D). Where the heck did 227232 come from? regards Tomas -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general