Re: pg_basebackup blocking all queries with horrible performance

Lonni J Friedman <netllama@xxxxxxxxx> · Mon, 11 Jun 2012 10:37:41 -0700

On Fri, Jun 8, 2012 at 7:29 PM, Fujii Masao <masao.fujii@xxxxxxxxx> wrote:
> On Sat, Jun 9, 2012 at 4:30 AM, Lonni J Friedman <netllama@xxxxxxxxx> wrote:
>> On Thu, Jun 7, 2012 at 11:04 PM, Craig Ringer <ringerc@xxxxxxxxxxxxx> wrote:
>>> On 06/08/2012 09:01 AM, Lonni J Friedman wrote:
>>>>
>>>> On Thu, Jun 7, 2012 at 5:07 PM, Jerry Sievers<gsievers19@xxxxxxxxxxx>
>>>>  wrote:
>>>>>
>>>>> You might try stopping pg_basebackup in place with SIGSTOP and check
>>>>>
>>>>> if problem goes away.  SIGCONT and you should  start having
>>>>> sluggishness again.
>>>>>
>>>>> If verified, then any sort of throttling mechanism should work.
>>>>
>>>>
>>>> I'm certain that the problem is triggered only when pg_basebackup is
>>>> running.  Its very predictable, and goes away as soon as pg_basebackup
>>>> finishes running.  What do you mean by a throttling mechanism?
>>>
>>>
>>> Sure, it only happens when pg_basebackup is running. But if you *pause*
>>> pg_basebackup, so it's still running but not currently doing work, does the
>>> problem go away? Does it come back when you unpause pg_basebackup? That's
>>> what Jerry was telling you to try.
>>>
>>> If the problem goes away when you pause pg_basebackup and comes back when
>>> you unpause it, it's probably a system load problem.
>>>
>>> If it doesn't go away, it's more likely to be a locking issue or something
>>> _other_ than simple load.
>>>
>>> SIGSTOP ("kill -STOP") pauses a process, and SIGCONT ("kill -CONT") resumes
>>> it, so on Linux you can use these to try and find out. When you SIGSTOP
>>> pg_basebackup then the postgres backend associated with it should block
>>> shortly afterwards as its buffers fill up and it can't send more data, so
>>> the load should come off the server.
>>>
>>> A "throttling mechanism" refers to anything that limits the rate or speed of
>>> a thing. In this case, what you want to do if your problem is system
>>> overload is to limit the speed at which pg_basebackup does its work so other
>>> things can still get work done. In other words you want to throttle it.
>>> Typical throttling mechanisms include the "ionice" and "renice" commands to
>>> change I/O and CPU priority, respectively.
>>>
>>> Note that you may need to change the priority of the *backend* that
>>> pg_basebackup is using, not necessarily the pg_basebackup command its self.
>>> I haven't done enough with Pg's replication to know how that works, so
>>> someone else will have to fill that bit in.
>>
>> Thanks for your reply.  I've confirmed that issuing a SIGSTOP does
>> eliminate the thrashing, and issuing a SIGCONT resumes the thrash.
>>
>> I've looked at iostat output both before & during pg_basebackup runs,
>> and I'm not seeing any indication that the problem is due to disk IO
>> bottlenecks.  The numbers don't vary very much at all between the good
>> & bad times.  This is typical when pg_basebackup is running:
>> ########
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> md0
>>                  0.00     0.00   67.76   68.62     4.42     1.46
>> 88.34     0.00    0.00    0.00    0.00   0.00   0.00
>> ########
>>
>> and this is when the system is ok:
>> ########
>> Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s
>> avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
>> md0
>>                  0.00     0.00   68.04   68.56     4.44     1.46
>> 88.39     0.00    0.00    0.00    0.00   0.00   0.00
>> ########
>>
>>
>> I looked at vmstat output, but nothing is jumping out at me as being
>> dramatically different when pg_basebackup is running.  swap in and
>> swap out are zero 100% of the time for the good & bad perf cases.  I
>> can post example output if someone is interested, or if there's
>> something specific that I should be looking at as a potential problem,
>> let me know.
>
> Did you set synchronous_standby_names to '*'? If so, the problem you
> encountered can happen.
>
> When synchronous_standby_names is '*', you cannot control which
> standbys take a role of synchronous standby. The standby which you
> expect to run as asynchronous one might be synchronous one. So
> my guess is that at first one of your three standbys was running as
> synchronous standby, and all queries were executed normally. But
> when you started pg_basebackup, pg_basebackup unexpectedly
> got the role of synchronous standby from another standby. Since
> pg_basebackup doesn't send the information about replication
> progress back to the master, all queries (more precisely, transaction
> commit) got stuck, and kept waiting for the reply from synchronous
> standby.
>
> You can avoid this problem by setting synchronous_standby_names
> to the names of your standbys instead of '*'.

I don't have synchronous_standby_names set at all.  I'm only doing
asynchronous replication.

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin