On Tue, Jun 5, 2012 at 6:24 PM, Mike Christensen <mike@xxxxxxxxxxxxx> wrote:
Hi -
I'm trying to increase my general knowledge about how indexes work in
databases. Though my questions are probably general and implemented
in a similar way across major relational DBs, I'm also curious as to
how they're implemented in Postgres specifically (mainly because I
like PG, and am always interested in knowing if PG does things in some
cool and interesting way).
Quick! Create some test data!
drop table if exists foobar;
create table foobar
( a int not null primary key
, b int null
, c int null
, d int null
);
create index b_idx on foobar (b);
create index c_idx on foobar (c);
create index d_idx on foobar (d);
create or replace function generate_test_data() returns void as $$
declare
i integer;
begin
for i in 1..100000 loop
insert into foobar (a, b, c, d) values (i, i%2, i%10, i%1000);
end loop;
end;
$$ language plpgsql;
select generate_test_data();
vacuum analyze foobar;
I know the basics of how binary trees work, so I understand a query such as :
select * from Table where Id = 5;
Provided Id has a btree index on it.
sometimes. Sometimes not.
explain analyze
select * from foobar
where a = 1234;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Index Scan using foobar_pkey on foobar (cost=0.00..8.38 rows=1 width=16) (actual time=0.008..0.008 rows=1 loops=1)
Index Cond: (a = 1234)
Total runtime: 0.030 ms
(3 rows)
explain analyze
select *
from foobar
where b = 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on foobar (cost=0.00..1791.00 rows=50270 width=16) (actual time=0.011..13.603 rows=50000 loops=1)
Filter: (b = 1)
Rows Removed by Filter: 50000
Total runtime: 16.005 ms
(4 rows)
I'm curious as to how indexes
are used with OR and AND clauses.
Something like:
select * from Table where X = 5 or y = 3;
It seems to me both the index of X would be scanned and those rows
would be loaded into memory, and then the index of Y would be scanned
and loaded. Then, Postgres would have to merge both sets into a set
of unique rows. Is this pretty much what's going on? Let's ignore
table stats for now.
The right answer is "sometimes". Here's a query that is solved the way you expect:
explain analyze
select *
from foobar
where c = 1
or d = 1;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foobar (cost=226.47..920.65 rows=10202 width=16) (actual time=1.284..3.784 rows=10000 loops=1)
Recheck Cond: ((c = 1) OR (d = 1))
-> BitmapOr (cost=226.47..226.47 rows=10212 width=0) (actual time=1.174..1.174 rows=0 loops=1)
-> Bitmap Index Scan on c_idx (cost=0.00..216.23 rows=10113 width=0) (actual time=1.157..1.157 rows=10000 loops=1)
Index Cond: (c = 1)
-> Bitmap Index Scan on d_idx (cost=0.00..5.13 rows=98 width=0) (actual time=0.016..0.016 rows=100 loops=1)
Index Cond: (d = 1)
Total runtime: 4.452 ms
(8 rows)
And here is a query that is not solved the way you expect, because the index on B is not useful.
explain analyze
select *
from foobar
where c = 1
or b = 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on foobar (cost=0.00..2041.00 rows=55299 width=16) (actual time=0.007..14.922 rows=50000 loops=1)
Filter: ((c = 1) OR (b = 1))
Rows Removed by Filter: 50000
Total runtime: 17.002 ms
(4 rows)
Then, something like:
select * from Table where X = 5 AND y = 3;
I would imagine the same thing is going on, only Postgres would find
rows that appear in both sets. I also imagine Postgres might create a
hash table from the larger set, and then iterate through the smaller
set looking for rows that were in that hash table
and you would be largely right about that. Interestingly, on an earlier run of this, I got a BitmapAnd strategy, rather than applying the c=1 filter to the rows after identifying all the d=1 rows.
explain analyze
select *
from foobar
where c = 1
and d = 1;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foobar (cost=5.13..256.48 rows=10 width=16) (actual time=0.046..0.150 rows=100 loops=1)
Recheck Cond: (d = 1)
Filter: (c = 1)
-> Bitmap Index Scan on d_idx (cost=0.00..5.13 rows=98 width=0) (actual time=0.026..0.026 rows=100 loops=1)
Index Cond: (d = 1)
Total runtime: 0.179 ms
(6 rows)
Lastly, If you had a query such as:
select * from Table where X IN (1,2,3,4,5,6,7);
I would imagine Postgres would parse that query as a bunch of OR
clauses. Does this mean the index for X would be scanned 7 times and
merged into a set of unique results? Though, obviously if Postgres
estimated this would return the majority of the rows in the table, it
would probably just ignore the index completely.
and you would be right on both counts
explain analyze
select *
from foobar
where c in (1, 2, 3);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on foobar (cost=609.14..1562.18 rows=29967 width=16) (actual time=3.456..7.589 rows=30000 loops=1)
Recheck Cond: (c = ANY ('{1,2,3}'::integer[]))
-> Bitmap Index Scan on c_idx (cost=0.00..601.64 rows=29967 width=0) (actual time=3.342..3.342 rows=30000 loops=1)
Index Cond: (c = ANY ('{1,2,3}'::integer[]))
Total runtime: 9.083 ms
(5 rows)
explain analyze
select *
from foobar
where c in (1, 2, 3, 4, 5, 6);
QUERY PLAN
-------------------------------------------------------------------------------------------------------------
Seq Scan on foobar (cost=0.00..2291.00 rows=59723 width=16) (actual time=0.005..18.450 rows=60000 loops=1)
Filter: (c = ANY ('{1,2,3,4,5,6}'::integer[]))
Rows Removed by Filter: 40000
Total runtime: 21.181 ms
(4 rows)
Thanks!
Mike