On Sat, May 28, 2016 at 11:32 AM, hubert depesz lubaczewski <depesz@xxxxxxxxxx> wrote: > On Sat, May 28, 2016 at 10:32:15AM -0700, Jeff Janes wrote: >> If that wasn't informative, I'd attach to one of the processes with >> the gdb debugger and get a backtrace. (You might want to do that a >> few times, just in case the first one accidentally caught the code >> during a part of its execution which was not in the bottlenecked >> spot.) > > I did: > for a in $( ps uww -U postgres | grep BIND | awk '{print $2}' ); do echo "bt" | gdb -p $a > $a.bt.log 2>&1; done > > Since there is lots of output, I made a tarball with it, and put it on > https://depesz.com/various/all.bt.logs.tar.gz > > The file is ~ 19kB. If you look at the big picture, it is what I thought: the planner probing the index end points when planning a merge join. Although I don't know how that turns into the low-level specifics you are seeing. It looks like everyone becomes interested in the same disk page at the same time. One process starts reading in that page, and all the others queue up on a lock waiting for that one it to finish. So what you see is 1 disk wait and N-1 semop waits. But, if the the page is that popular, why is it not staying in cache? Either which page is popular is moving around quickly (which is hard to see how that would be plausible if ti represents the index end-points) or there are so many simultaneously popular pages that they can't all fit in cache. So my theory is that you deleted a huge number of entries off from either end of the index, that transaction committed, and that commit became visible to all. Planning a mergejoin needs to dig through all those tuples to probe the true end-point. On master, the index entries quickly get marked as LP_DEAD so future probes don't have to do all that work, but on the replicas those index hint bits are, for some unknown to me reason, not getting set. So it has to scour the all the heap pages which might have the smallest/largest tuple, on every planning cycle, and that list of pages is very large leading to occasional IO stalls. Or perhaps the master realizes the deleting transaction is committed-visible-to-all, but the replicas believe there are still some transactions which could care about them, and that is the reason they are not getting hinted? >> > So far we've: >> > 1. ruled out IO problems (enough io both in terms of bandwidth and iops) >> >> Are you saying that you are empirically not actually doing any IO >> waits, or just that the IO capacity is theoretically sufficient? > > there are no iowaits per what iostat returns. Or, there are but very low. If each IO wait has a pile-up of processes waiting behind it on semops, then it could have a much larger effect than the raw numbers would indicate. Cheers, Jeff -- Sent via pgsql-general mailing list (pgsql-general@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-general