Re: Confusion about composite indexes

Merlin Moncure <mmoncure@xxxxxxxxx> · Mon, 21 May 2012 15:07:02 -0500

On Mon, May 21, 2012 at 2:34 PM, Bill Mitchell <bill@xxxxxxxxxxxxxxx> wrote:
> I am searching for some logic behind the selection of an index in postgres
> -- it seems that if I have a composite index based on both columns in a join
> table, it's only referenced if I query on the first term in the composite
> index.  I've read
> http://www.postgresql.org/docs/9.1/static/indexes-multicolumn.html over and
> over and think that this is the same scenario as what I face.
>
> As an example:
> OUTLET:  has OUTLET_ID as a primary key, consisting of about a million rows
> MEDIA: has MEDIA_ID as a primary key, and table consists of only 10 rows
> OUTLET_MEDIA: a join table used to correlate, and this has about a million
> rows
>
> Each outlet may have 1+ Media (technically, 0 or more, in this schema)
>
>   Table "public.outlet_media"
>   Column   |  Type  | Modifiers
> -----------+--------+-----------
>  outlet_id | bigint | not null
>  media_id  | bigint | not null
> Indexes:
>     "outlet_media_pkey" PRIMARY KEY, btree (outlet_id, media_id)
> Foreign-key constraints:
>     "fkfde1d912281e6fbf" FOREIGN KEY (media_id) REFERENCES media(media_id)
>     "fkfde1d9125014e32a" FOREIGN KEY (outlet_id) REFERENCES
> outlet(outlet_id)
>
> When I test performance, using an OUTLET_ID, the query uses the
> outlet_media_pkey index
>
> # explain analyze select * from outlet_media where outlet_id in (select
> outlet_id from outlet order by random() limit 50);
>                                                                 QUERY
> PLAN
> ------------------------------------------------------------------------------------------------------------------------------------------
>  Nested Loop  (cost=67625.64..68048.50 rows=50 width=16) (actual
> time=841.115..884.669 rows=50 loops=1)
>    ->  HashAggregate  (cost=67625.64..67626.14 rows=50 width=8) (actual
> time=841.048..841.090 rows=50 loops=1)
>          ->  Limit  (cost=67624.89..67625.01 rows=50 width=8) (actual
> time=840.980..841.011 rows=50 loops=1)
>                ->  Sort  (cost=67624.89..70342.66 rows=1087110 width=8)
> (actual time=840.978..840.991 rows=50 loops=1)
>                      Sort Key: (random())
>                      Sort Method: top-N heapsort  Memory: 27kB
>                      ->  Seq Scan on outlet  (cost=0.00..31511.88
> rows=1087110 width=8) (actual time=6.693..497.383 rows=1084628 loops=1)
>    ->  Index Scan using outlet_media_pkey on outlet_media  (cost=0.00..8.43
> rows=1 width=16) (actual time=0.869..0.870 rows=1 loops=50)
>          Index Cond: (outlet_id = outlet.outlet_id)
>  Total runtime: 884.759 ms
> (10 rows)
>
> However if I try the reverse, to search using the MEDIA_ID
> # explain analyze select * from outlet_media where media_id in (select
> media_id from media where media_name='Online News');
>                                                       QUERY
> PLAN
> -----------------------------------------------------------------------------------------------------------------------
>  Hash Join  (cost=1.19..21647.53 rows=362125 width=16) (actual
> time=0.034..0.034 rows=0 loops=1)
>    Hash Cond: (outlet_media.media_id = media.media_id)
>    ->  Seq Scan on outlet_media  (cost=0.00..16736.76 rows=1086376 width=16)
> (actual time=0.012..0.012 rows=1 loops=1)
>    ->  Hash  (cost=1.18..1.18 rows=1 width=8) (actual time=0.013..0.013
> rows=0 loops=1)
>          Buckets: 1024  Batches: 1  Memory Usage: 0kB
>          ->  HashAggregate  (cost=1.17..1.18 rows=1 width=8) (actual
> time=0.013..0.013 rows=0 loops=1)
>                ->  Seq Scan on media  (cost=0.00..1.16 rows=1 width=8)
> (actual time=0.012..0.012 rows=0 loops=1)
>                      Filter: ((media_name)::text = 'Online News'::text)
>  Total runtime: 0.084 ms
> (9 rows)
>
>
> Thanks in advance for whatever light can be shed.  If it's safer for me to
> just create individual indexes on each of the two columns  (" Multicolumn
> indexes should be used sparingly. In most situations, an index on a single
> column is sufficient and saves space and time")

There are a couple of things going on here.  First of all, 'order by
random() limit..' guarantees a full seq scan and a short of whatever
you're ordering, so it's never going to be very efficient.  When you
wen the other way, you supplied a hard search term which can be fed to
the index and optimized through.

Secondly, if you have a mapping table that has to support searches in
both directions, it's pretty typical to do something like this:

create table map(a int references ..., b int references..., primary key(a,b));
create index on map(b);

So you can get fully index lookups on all of a, b, ab, and ba.  the
primary key can't optimize ba because indexes only fully match if
candidate fields are supplied from left to right order.  They can
still help somewhat, but to a lesser degree.

merlin

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general