Re: I'd like to learn a bit more about how indexes work

Mike Christensen <mike@xxxxxxxxxxxxx> · Tue, 5 Jun 2012 22:50:16 -0700

On Tue, Jun 5, 2012 at 7:20 PM, Chris Curvey <chris@xxxxxxxxxxxxxxx> wrote:
>
>
> On Tue, Jun 5, 2012 at 6:24 PM, Mike Christensen <mike@xxxxxxxxxxxxx> wrote:
>>
>> Hi -
>>
>> I'm trying to increase my general knowledge about how indexes work in
>> databases.  Though my questions are probably general and implemented
>> in a similar way across major relational DBs, I'm also curious as to
>> how they're implemented in Postgres specifically (mainly because I
>> like PG, and am always interested in knowing if PG does things in some
>> cool and interesting way).
>
>
> Quick!   Create some test data!
>
> drop table if exists foobar;
> create table foobar
> (  a int not null primary key
> ,  b int null
> ,  c int null
> ,  d int null
> );
> create index b_idx on foobar (b);
> create index c_idx on foobar (c);
> create index d_idx on foobar (d);
>
> create or replace function generate_test_data() returns void as $$
> declare
>   i integer;
> begin
>   for i in 1..100000 loop
>      insert into foobar (a, b, c, d) values (i, i%2, i%10, i%1000);
>   end loop;
> end;
> $$ language plpgsql;
>
> select generate_test_data();
>
> vacuum analyze foobar;
>
>>
>>
>> I know the basics of how binary trees work, so I understand a query such
>> as :
>>
>> select * from Table where Id = 5;
>>
>> Provided Id has a btree index on it.
>
>
> sometimes.  Sometimes not.
>
> explain analyze
> select * from foobar
> where a = 1234;
>                                                      QUERY PLAN
>
> ---------------------------------------------------------------------------------------------------------------------
>  Index Scan using foobar_pkey on foobar  (cost=0.00..8.38 rows=1 width=16)
> (actual time=0.008..0.008 rows=1 loops=1)
>    Index Cond: (a = 1234)
>  Total runtime: 0.030 ms
> (3 rows)
>
> explain analyze
> select *
> from foobar
> where b = 1;
>                                                  QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------
>  Seq Scan on foobar  (cost=0.00..1791.00 rows=50270 width=16) (actual
> time=0.011..13.603 rows=50000 loops=1)
>    Filter: (b = 1)
>    Rows Removed by Filter: 50000
>  Total runtime: 16.005 ms
> (4 rows)
>
>>
>>  I'm curious as to how indexes
>> are used with OR and AND clauses.
>>
>> Something like:
>>
>> select * from Table where X = 5 or y = 3;
>>
>> It seems to me both the index of X would be scanned and those rows
>> would be loaded into memory, and then the index of Y would be scanned
>> and loaded.  Then, Postgres would have to merge both sets into a set
>> of unique rows.  Is this pretty much what's going on?  Let's ignore
>> table stats for now.
>
>
> The right answer is "sometimes".  Here's a query that is solved the way you
> expect:
>
> explain analyze
> select *
> from foobar
> where c = 1
> or d = 1;
>                                                           QUERY PLAN
>
> ------------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on foobar  (cost=226.47..920.65 rows=10202 width=16)
> (actual time=1.284..3.784 rows=10000 loops=1)
>    Recheck Cond: ((c = 1) OR (d = 1))
>    ->  BitmapOr  (cost=226.47..226.47 rows=10212 width=0) (actual
> time=1.174..1.174 rows=0 loops=1)
>          ->  Bitmap Index Scan on c_idx  (cost=0.00..216.23 rows=10113
> width=0) (actual time=1.157..1.157 rows=10000 loops=1)
>                Index Cond: (c = 1)
>          ->  Bitmap Index Scan on d_idx  (cost=0.00..5.13 rows=98 width=0)
> (actual time=0.016..0.016 rows=100 loops=1)
>                Index Cond: (d = 1)
>  Total runtime: 4.452 ms
> (8 rows)
>
> And here is a query that is not solved the way you expect, because the index
> on B is not useful.
>
> explain analyze
> select *
> from foobar
> where c = 1
> or b = 1;
>                                                  QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------
>  Seq Scan on foobar  (cost=0.00..2041.00 rows=55299 width=16) (actual
> time=0.007..14.922 rows=50000 loops=1)
>    Filter: ((c = 1) OR (b = 1))
>    Rows Removed by Filter: 50000
>  Total runtime: 17.002 ms
> (4 rows)
>
>
>>
>>
>> Then, something like:
>>
>> select * from Table where X = 5 AND y = 3;
>>
>> I would imagine the same thing is going on, only Postgres would find
>> rows that appear in both sets.  I also imagine Postgres might create a
>> hash table from the larger set, and then iterate through the smaller
>> set looking for rows that were in that hash table
>
>
> and you would be largely right about that.  Interestingly, on an earlier run
> of this, I got a BitmapAnd strategy, rather than applying the c=1 filter to
> the rows after identifying all the d=1 rows.
>
> explain analyze
> select *
> from foobar
> where c = 1
> and d = 1;
>                                                    QUERY PLAN
>
> -----------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on foobar  (cost=5.13..256.48 rows=10 width=16) (actual
> time=0.046..0.150 rows=100 loops=1)
>    Recheck Cond: (d = 1)
>    Filter: (c = 1)
>    ->  Bitmap Index Scan on d_idx  (cost=0.00..5.13 rows=98 width=0) (actual
> time=0.026..0.026 rows=100 loops=1)
>          Index Cond: (d = 1)
>  Total runtime: 0.179 ms
> (6 rows)
>
>
>>
>>
>> Lastly, If you had a query such as:
>>
>> select * from Table where X IN (1,2,3,4,5,6,7);
>>
>> I would imagine Postgres would parse that query as a bunch of OR
>> clauses.  Does this mean the index for X would be scanned 7 times and
>> merged into a set of unique results?  Though, obviously if Postgres
>> estimated this would return the majority of the rows in the table, it
>> would probably just ignore the index completely.
>
>
>  and you would be right on both counts
>
> explain analyze
> select *
> from foobar
> where c in (1, 2, 3);
>                                                        QUERY PLAN
>
> ------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on foobar  (cost=609.14..1562.18 rows=29967 width=16)
> (actual time=3.456..7.589 rows=30000 loops=1)
>    Recheck Cond: (c = ANY ('{1,2,3}'::integer[]))
>    ->  Bitmap Index Scan on c_idx  (cost=0.00..601.64 rows=29967 width=0)
> (actual time=3.342..3.342 rows=30000 loops=1)
>          Index Cond: (c = ANY ('{1,2,3}'::integer[]))
>  Total runtime: 9.083 ms
> (5 rows)
>
> explain analyze
> select *
> from foobar
> where c in (1, 2, 3, 4, 5, 6);
>                                                  QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------
>  Seq Scan on foobar  (cost=0.00..2291.00 rows=59723 width=16) (actual
> time=0.005..18.450 rows=60000 loops=1)
>    Filter: (c = ANY ('{1,2,3,4,5,6}'::integer[]))
>    Rows Removed by Filter: 40000
>  Total runtime: 21.181 ms
> (4 rows)

Thanks!  One thing that still confuses me is the difference between IN
and OR.  With this query:

explain analyze
select *
from foobar
where d in (500, 750);

It scans the d index only once:

'Bitmap Heap Scan on foobar  (cost=10.03..400.63 rows=196 width=16)
(actual time=0.128..0.489 rows=200 loops=1)'
'  Recheck Cond: (d = ANY ('{500,750}'::integer[]))'
'  ->  Bitmap Index Scan on d_idx  (cost=0.00..9.98 rows=196 width=0)
(actual time=0.086..0.086 rows=200 loops=1)'
'        Index Cond: (d = ANY ('{500,750}'::integer[]))'
'Total runtime: 0.592 ms'

And if I change it to:

explain analyze
select *
from foobar
where d = 500 or d = 750;

Then the d index is scanned twice and it uses a BItmapOr:

'Bitmap Heap Scan on foobar  (cost=10.10..401.18 rows=196 width=16)
(actual time=0.118..0.480 rows=200 loops=1)'
'  Recheck Cond: ((d = 500) OR (d = 750))'
'  ->  BitmapOr  (cost=10.10..10.10 rows=196 width=0) (actual
time=0.079..0.079 rows=0 loops=1)'
'        ->  Bitmap Index Scan on d_idx  (cost=0.00..5.00 rows=98
width=0) (actual time=0.047..0.047 rows=100 loops=1)'
'              Index Cond: (d = 500)'
'        ->  Bitmap Index Scan on d_idx  (cost=0.00..5.00 rows=98
width=0) (actual time=0.030..0.030 rows=100 loops=1)'
'              Index Cond: (d = 750)'
'Total runtime: 0.611 ms'

Shouldn't these two queries be the same?  I'm curious as to why the OR
operator will do two Bitmap Index Scans, then OR then together - while
the IN clause will do a single Index scan that gets all 200 rows at
once.  Probably the answer is just "Because that's the way it was
written, deal."

Anyway, I don't want to be too much of a vampire.  I'll keep reading
stuff online as well.  I found some old Tom Lane posts that are
helpful, but a lot of it is over my head still.

Mike

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general