On Tue, Jun 12, 2012 at 2:37 AM, Lonni J Friedman <netllama@xxxxxxxxx> wrote: > On Fri, Jun 8, 2012 at 7:29 PM, Fujii Masao <masao.fujii@xxxxxxxxx> wrote: >> On Sat, Jun 9, 2012 at 4:30 AM, Lonni J Friedman <netllama@xxxxxxxxx> wrote: >>> On Thu, Jun 7, 2012 at 11:04 PM, Craig Ringer <ringerc@xxxxxxxxxxxxx> wrote: >>>> On 06/08/2012 09:01 AM, Lonni J Friedman wrote: >>>>> >>>>> On Thu, Jun 7, 2012 at 5:07 PM, Jerry Sievers<gsievers19@xxxxxxxxxxx> >>>>> wrote: >>>>>> >>>>>> You might try stopping pg_basebackup in place with SIGSTOP and check >>>>>> >>>>>> if problem goes away. SIGCONT and you should start having >>>>>> sluggishness again. >>>>>> >>>>>> If verified, then any sort of throttling mechanism should work. >>>>> >>>>> >>>>> I'm certain that the problem is triggered only when pg_basebackup is >>>>> running. Its very predictable, and goes away as soon as pg_basebackup >>>>> finishes running. What do you mean by a throttling mechanism? >>>> >>>> >>>> Sure, it only happens when pg_basebackup is running. But if you *pause* >>>> pg_basebackup, so it's still running but not currently doing work, does the >>>> problem go away? Does it come back when you unpause pg_basebackup? That's >>>> what Jerry was telling you to try. >>>> >>>> If the problem goes away when you pause pg_basebackup and comes back when >>>> you unpause it, it's probably a system load problem. >>>> >>>> If it doesn't go away, it's more likely to be a locking issue or something >>>> _other_ than simple load. >>>> >>>> SIGSTOP ("kill -STOP") pauses a process, and SIGCONT ("kill -CONT") resumes >>>> it, so on Linux you can use these to try and find out. When you SIGSTOP >>>> pg_basebackup then the postgres backend associated with it should block >>>> shortly afterwards as its buffers fill up and it can't send more data, so >>>> the load should come off the server. >>>> >>>> A "throttling mechanism" refers to anything that limits the rate or speed of >>>> a thing. In this case, what you want to do if your problem is system >>>> overload is to limit the speed at which pg_basebackup does its work so other >>>> things can still get work done. In other words you want to throttle it. >>>> Typical throttling mechanisms include the "ionice" and "renice" commands to >>>> change I/O and CPU priority, respectively. >>>> >>>> Note that you may need to change the priority of the *backend* that >>>> pg_basebackup is using, not necessarily the pg_basebackup command its self. >>>> I haven't done enough with Pg's replication to know how that works, so >>>> someone else will have to fill that bit in. >>> >>> Thanks for your reply. I've confirmed that issuing a SIGSTOP does >>> eliminate the thrashing, and issuing a SIGCONT resumes the thrash. >>> >>> I've looked at iostat output both before & during pg_basebackup runs, >>> and I'm not seeing any indication that the problem is due to disk IO >>> bottlenecks. The numbers don't vary very much at all between the good >>> & bad times. This is typical when pg_basebackup is running: >>> ######## >>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>> md0 >>> 0.00 0.00 67.76 68.62 4.42 1.46 >>> 88.34 0.00 0.00 0.00 0.00 0.00 0.00 >>> ######## >>> >>> and this is when the system is ok: >>> ######## >>> Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s >>> avgrq-sz avgqu-sz await r_await w_await svctm %util >>> md0 >>> 0.00 0.00 68.04 68.56 4.44 1.46 >>> 88.39 0.00 0.00 0.00 0.00 0.00 0.00 >>> ######## >>> >>> >>> I looked at vmstat output, but nothing is jumping out at me as being >>> dramatically different when pg_basebackup is running. swap in and >>> swap out are zero 100% of the time for the good & bad perf cases. I >>> can post example output if someone is interested, or if there's >>> something specific that I should be looking at as a potential problem, >>> let me know. >> >> Did you set synchronous_standby_names to '*'? If so, the problem you >> encountered can happen. >> >> When synchronous_standby_names is '*', you cannot control which >> standbys take a role of synchronous standby. The standby which you >> expect to run as asynchronous one might be synchronous one. So >> my guess is that at first one of your three standbys was running as >> synchronous standby, and all queries were executed normally. But >> when you started pg_basebackup, pg_basebackup unexpectedly >> got the role of synchronous standby from another standby. Since >> pg_basebackup doesn't send the information about replication >> progress back to the master, all queries (more precisely, transaction >> commit) got stuck, and kept waiting for the reply from synchronous >> standby. >> >> You can avoid this problem by setting synchronous_standby_names >> to the names of your standbys instead of '*'. > > I don't have synchronous_standby_names set at all. I'm only doing > asynchronous replication. Hmm... I have no idea about what happened on your environment, for now. Could you show me the self-contained test case? Regards, -- Fujii Masao -- Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-admin