Re: PostgreSQL does not choose my indexes well

Stephen Frost <sfrost@xxxxxxxxxxx> · Thu, 23 Apr 2020 16:01:43 -0400

Greetings,

* Tom Lane (tgl@xxxxxxxxxxxxx) wrote:
> Stephen Frost <sfrost@xxxxxxxxxxx> writes:
> > I do wonder if we are maybe missing a bet at times though, considering
> > that I'm pretty sure we'll always go through the index in order, and
> > therefore randomly, even when we don't actually need the results in
> > order..?  Has there been much consideration for just opening an index
> > and sequentially scanning it in cases like this where we have to go
> > through all of the index anyway and don't need the results in order?
> 
> As I recall, it's unsafe to do so because of consistency considerations,
> specifically there's a risk of missing or double-visiting some entries due
> to concurrent index page splits.  VACUUM has some way around that, but it
> doesn't work for regular data-fetching cases.  (nbtree/README has more
> about this, but I don't feel like looking it up for you.)

That README isn't exactly small, but the mention of VACUUM having a
trick there helped me find this:

-------
The tricky part of this is to avoid missing any deletable tuples in the
presence of concurrent page splits: a page split could easily move some
tuples from a page not yet passed over by the sequential scan to a
lower-numbered page already passed over.  (This wasn't a concern for the
index-order scan, because splits always split right.)  To implement this,
we provide a "vacuum cycle ID" mechanism that makes it possible to
determine whether a page has been split since the current btbulkdelete
cycle started.  If btbulkdelete finds a page that has been split since
it started, and has a right-link pointing to a lower page number, then
it temporarily suspends its sequential scan and visits that page instead.
It must continue to follow right-links and vacuum dead tuples until
reaching a page that either hasn't been split since btbulkdelete started,
or is above the location of the outer sequential scan.  Then it can resume
the sequential scan.  This ensures that all tuples are visited.
-------

So the issue is with a page split happening and a tuple being moved to
an earlier leaf page, resulting in us potentially not seeing it even
though we should have during a sequential scan.  The trick that VACUUM
does seems pretty involved and would be more complicated for use for
this as it's not ok to return the same tuples multiple times (though
perhaps in a BitmapIndexScan we could handle that..).  Then again, maybe
the skipping scan mechanism that's been talked about recently would let
us avoid having to scan the entire index even in cases where the
conditional doesn't include the initial index columns, since it looks
like that might be what we're doing now.

> My guess based on your results is that the OP's table *isn't* all-visible,
> or at least the planner doesn't know it is.

Hrmpf, even then I seem to end up with an IndexOnlyScan-

=# select * from pg_visibility_map('entidad') where all_visible;
blkno | all_visible | all_frozen 
-------+-------------+------------
(0 rows)

analyze entidad;

=# select relallvisible from pg_class where relname = 'entidad';
 relallvisible 
---------------
             0
(1 row)

=# explain (analyze, buffers) select * from entidad where cod_tabla = 4;
                                                                      QUERY PLAN                                                                       
-------------------------------------------------------------------------------------------------------------------------------------------------------
 Index Only Scan using idx_entidad_tabla_4 on entidad  (cost=0.43..170908.14 rows=1657114 width=24) (actual time=0.312..3511.629 rows=1720668 loops=1)
   Heap Fetches: 3441336
   Buffers: shared hit=6444271 read=469499
 Planning Time: 2.831 ms
 Execution Time: 3563.413 ms
(5 rows)

I'm pretty suspicious that they've made some odd planner configuration
changes or something along those lines to end up with the plan they got,
or there's some reason we don't think we can use the partial index.

Thanks,

Stephen
Attachment:
signature.asc

Description: PGP signature