On 10/11/22 22:35, Julien Rouhaud wrote:
On Tue, Oct 11, 2022 at 07:42:55PM +0200, Guillaume Lelarge wrote:
Le mar. 11 oct. 2022 à 18:27, Alvaro Herrera <alvherre@xxxxxxxxxxxxxx> a
écrit :
On 2022-Oct-11, Tom Lane wrote:
Are there any tables in this query where extremal values of the join
key are likely to be in recently-added or recently-dead rows? Does
VACUUM'ing on the primary help?
I remember having an hypothesis, upon getting a report of this exact
problem on a customer system once, that it could be due to killtuple not
propagating to standbys except by FPIs. I do not remember if we proved
that true or not. I do not remember observing that tables were being
read, however.
Thanks for your answers.
The last predicate is "and c3>='2020-10-10' and c3<='2022-10-10'. I have no
idea on the actual use of c3 but rows with c3 at '2022-10-10' (which is
yesterday) is much probably recently-added. I can ask my customer if you
want but this looks like a pretty safe bet.
On the VACUUM question, I didn't say, but we're kind of wondering if it was
lacking a recent-enough VACUUM. So they're doing a VACUUM tonight on the
database (and especially on the 1.6TB table which is part of the query).
I'm kind of skeptical because if the VACUUM wasn't enough on the standby,
it should be the same on the primary.
Actually, there are two things that really bug me:
* why the difference between primary and both standbys?
* why now? (it worked great before this weekend, and the only thing I know
happened before is a batch delete on sunday... which may be a good-enough
reason for things to get screwed, but once again, why only both standbys?)
Julien Rouhaud also told me about killtuples, but I have no idea what they
are. I suppose this is different from dead tuples. Anyway, if you can
enlighten me, I'll be happy :)
That's an optimisation where an index scan can mark an index entry as dead
(LP_DEAD) if if tries to fetch some data from the heap that turns out to be all
dead, so further scans won't have to check again (you can grep kill_prior_tuple
in the source for more details). As that's a hint bit, it may not be
replicated unless you enable wal_log_hints or data_checksums (or write it as a
FPI indeed), which could explain discrepancy between primary (after a first
slow index scan) and standby nodes.
But since your customer recreated their standbys from scratch *after* that
delete, all the nodes should have those hint bits set (Guillaume confirmed
off-list that they used a fresh BASE_BACKUP). Note that Guillaume also
confirmed off-list that the customer has checksums enabled, which means that
MarkBufferDirtyHint() should be guaranteed to mark the buffers as dirty, so I'm
out of ideas to explain the different behavior on standbys.
Would EXPLAIN (VERBOSE, COSTS, FORMAT JSON) run on both nodes help show any
differences?
--
Angular momentum makes the world go 'round.