Re: swarm of processes in BIND state?

Jeff Janes <jeff.janes@xxxxxxxxx> · Mon, 30 May 2016 11:05:17 -0700

On Sat, May 28, 2016 at 11:32 AM, hubert depesz lubaczewski
<depesz@xxxxxxxxxx> wrote:
> On Sat, May 28, 2016 at 10:32:15AM -0700, Jeff Janes wrote:

>> If that wasn't informative, I'd attach to one of the processes with
>> the gdb debugger and get a backtrace.  (You might want to do that a
>> few times, just in case the first one accidentally caught the code
>> during a part of its execution which was not in the bottlenecked
>> spot.)
>
> I did:
> for a in $( ps uww -U postgres | grep BIND | awk '{print $2}' ); do echo "bt" | gdb -p $a > $a.bt.log 2>&1; done
>
> Since there is lots of output, I made a tarball with it, and put it on
> https://depesz.com/various/all.bt.logs.tar.gz
>
> The file is ~ 19kB.

If you look at the big picture, it is what I thought: the planner
probing the index end points when planning a merge join.  Although I
don't know how that turns into the low-level specifics you are seeing.

It looks like everyone becomes interested in the same disk page at the
same time.  One process starts reading in that page, and all the
others queue up on a lock waiting for that one it to finish.  So what
you see is 1 disk wait and N-1 semop waits.

But, if the the page is that popular, why is it not staying in cache?
Either which page is popular is moving around quickly (which is hard
to see how that would be plausible if ti represents the index
end-points) or there are so many simultaneously popular pages that
they can't all fit in cache.

So my theory is that you deleted a huge number of entries off from
either end of the index, that transaction committed, and that commit
became visible to all.  Planning a mergejoin needs to dig through all
those tuples to probe the true end-point.  On master, the index
entries quickly get marked as LP_DEAD so future probes don't have to
do all that work, but on the replicas those index hint bits are, for
some unknown to me reason, not getting set.  So it has to scour the
all the heap pages which might have the smallest/largest tuple, on
every planning cycle, and that list of pages is very large leading to
occasional IO stalls.

Or perhaps the master realizes the deleting transaction is
committed-visible-to-all, but the replicas believe there are still
some transactions which could care about them, and that is the reason
they are not getting hinted?

>> > So far we've:
>> > 1. ruled out IO problems (enough io both in terms of bandwidth and iops)
>>
>> Are you saying that you are empirically not actually doing any IO
>> waits, or just that the IO capacity is theoretically sufficient?
>
> there are no iowaits per what iostat returns. Or, there are but very low.

If each IO wait has a pile-up of processes waiting behind it on
semops, then it could have a much larger effect than the raw numbers
would indicate.

Cheers,

Jeff

-- 
Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general