Re: Why is GIN index slowing down my query?

Tom Lane <tgl@xxxxxxxxxxxxx> · Sun, 01 Feb 2015 18:34:13 -0500

AlexK987 <alex.cue.987@xxxxxxxxx> writes:
> I've created a GIN index on an INT[] column, but it slows down the selects.
> Here is my table:

> create table talent(person_id INT NOT NULL,
> skills INT[] NOT NULL);

> insert into talent(person_id, skills)
> select generate_series, array[0, 1] || generate_series
> from generate_series(3, 1048575);

> create index talent_skills on talent using gin(skills);

> analyze talent;

> Here is my select:

> explain analyze 
> select * from talent 
> where skills <@ array[1, 15]

Well, that's pretty much going to suck given that data distribution.
Since "1" is a member of every last entry, the GIN scan will end up
examining every entry, and then rejecting all of them as not being
true subsets of [1,15].  I'm not sure whether it'd be practical to
teach GIN about negative proofs, ie noticing that rows containing "0"
could be eliminated based on the index contents.  But in any case it
does not know that today.

Another problem, at least with the default statistics target, is that
the entries for "0" and "1" swamp everything else so that the planner
doesn't know that eg "15" is really rare.

You'd be much better off if you could refactor the data representation so
that whatever you mean by "0" and "1" is stored separately from whatever
you mean by the other entries, ie, don't keep both extremely common and
extremely rare entries in the same array.

Also ... perhaps I'm reading too much into the name you chose for the
column, but I'm finding it hard to imagine why you'd care about the
performance of this query as opposed to "where skills @> array[1, 15]".
That is, wouldn't you typically be searching for people with at least
certain specified skills, rather than at most certain specified skills?

Another thing that maybe is a question for -hackers is why we consider
arraycontained to be an indexable operator at all.  As this example
demonstrates, the current GIN infrastructure isn't really capable
of coping efficiently, at least not in the general case.  It might be
all right in specific cases, but the planner doesn't have the smarts
to tell which is which.

			regards, tom lane

-- 
Sent via pgsql-performance mailing list (pgsql-performance@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance