Re: Slow "not in array" operation

Tom Lane <tgl@xxxxxxxxxxxxx> · Tue, 12 Nov 2019 15:50:51 -0500

Marco Colli <collimarco91@xxxxxxxxx> writes:
> 3) Here's the query plan that I get after disabling the seq scan:

>  Finalize Aggregate  (cost=2183938.89..2183938.90 rows=1 width=8) (actual
> time=94972.253..94972.254 rows=1 loops=1)

So, this is slower than the seqscan, which means the planner made the
right choice.

You seem to be imagining that there's some way the index could be used
with the NOT clause, but there isn't.  Indexable clauses are of the form
	indexed_column indexable_operator constant
and there's no provision for a NOT in that.  If we had a "not contained
in" array operator, the NOT could be folded to be of this syntactic form,
but it's highly unlikely that any index operator class would consider such
an operator to be a supported indexable operator.  It doesn't lend itself
to searching an index.

So the planner is doing the best it can, which in this case is a
full-table scan.

A conceivable solution, if the tags array is a lot smaller than
the table as a whole and the table is fairly static, is that you could
make a btree index on the tags array and let the planner fall back
to an index-only scan that is just using the index as a cheaper
source of the array data.  (This doesn't work for your existing GIST
index because GIST can't reconstruct the original arrays on-demand.)
I suspect though that this wouldn't win much, even if you disregard
the maintenance costs for the extra index.  The really fundamental
problem here is that a large fraction of the table satisfies the
NOT-in condition, and no index is going to beat a seqscan by all that
much when that's true.  Indexes are good at retrieving small portions
of tables.

			regards, tom lane