There are multiple ways see this problem. One way I am seeing is : how system will auto-recover from this particular state.
So ideally if st_procpid is set to zero it means this process is already terminated, however it might be have left some corrupted information in memory. So when other components tries to read beentry, they should also check if st_procpid is already set to zero, if yes it means this process is gone and no need to consider this process any more.
Agree this is solving particular issue about pg_stat_activity however don't see any harm in adding that check.
On Thu, May 9, 2019 at 9:29 PM Tom Lane <tgl@xxxxxxxxxxxxx> wrote:
neeraj kumar <neeru.cse@xxxxxxxxx> writes:
> Tom, may be I didn't make my point clear.
> There are two issues :
> 1) Why this value was left as odd
Because a function called by pgstat_bestart threw an error, is what
I'm guessing.
> 2) Why backend entry is still pending in beentry for backend process even
> after it was killed/terminated.
> I am talking about 2nd issue. My understanding is query on pg_stat_activity
> goes via all backend entries via beentry and it finds this wrong/corrupted
> entry. When a process terminates, ideally this backend entry into beentery
> should have also been cleaned. But why this still there? Whose
> responsibility it is to remove entry from beentry when process terminates ?
If you're imagining that something takes an electronic hacksaw to shared
memory and physically removes that array entry, you're mistaken. It's
still going to be there. Now, there *is* cleanup code --- this bit of
pgstat_beshutdown_hook is supposed to mark the process's entry as
not-in-use:
/*
* Clear my status entry, following the protocol of bumping st_changecount
* before and after. We use a volatile pointer here to ensure the
* compiler doesn't try to get cute.
*/
pgstat_increment_changecount_before(beentry);
beentry->st_procpid = 0; /* mark invalid */
pgstat_increment_changecount_after(beentry);
However, if something had left st_changecount odd before we got to this,
it'd still be odd afterwards. Sure, st_procpid is now zero, but that
doesn't help readers because they're not supposed to believe st_procpid
is valid unless the changecount is even.
You could maybe argue that this cleanup code should be trying to ensure
that the changecount is left even, but I don't think that's an appropriate
response. If it's not even when we get here, we already screwed up very
badly, because we were breaking the protocol for (potentially) as long as
the process has been in existence. Moreover, if you're worried about
corner cases where we did mess that up, what of corner cases where process
exit fails before getting here? I think the right answer is to bring the
hammer down as soon as we mess up, which is what the critical-section
mechanism is for.
regards, tom lane
-------------------------------------
Thanks
Neeraj Kumar,
+1 (206) 427-7267
Thanks
Neeraj Kumar,
+1 (206) 427-7267