Re: Big UPDATE breaking replication

Kouber Saparev <kouber@xxxxxxxxxxx> · Thu, 06 Jun 2013 15:19:51 +0300

On 06/04/2013 06:47 PM, Steve Crawford wrote:
> On 06/04/2013 04:53 AM, Kouber Saparev wrote:
>> Hello,
>>
>> We are using the 9.1 built-in streaming replication.
>>
>> Recently our slave nodes fell behind because of an UPDATE statement. It
>> took about 3 minutes to execute, but it affected half a million records,
>> hence the replication broke with the "requested WAL segment ... has
>> already been removed" series of error messages.
>>
>> The WAL settings we have are:
>>
>> max_wal_senders = 6
>> wal_keep_segments = 60
>> max_standby_archive_delay = 300s
>>
>>
>> I guess increasing the wal_keep_segments value would prevent it from
>> happening in the future, but increase it with how much? What value would
>> be high enough?
>
> You can use WAL shipping to protect against this or set
> wal_keep_segments higher. I set my main server to a tad over 1,000 and
> know I can do a full restore on the master without coming close to
> breaking replication. My xlog dir is 17G. A bit of a waste, perhaps,
> but I've noted no ill effects and it's still only a sliver of the
> total drive capacity.

Well, the streaming replication involves the WAL shipping already. Do
you mean to archive the WALs somewhere and then scp them with a cron
job? I doubt it would be fast enough neither.

>
>
>>
>> Also we noticed some strange error message appearing shortly before and
>> after this same statement: "LOG:  out of file descriptors: Too many open
>> files; release and retry".
>>
>> Could it be related somehow and what does it mean exactly?
>
> What is your setup (Linux? Mac? Windows? VM in the cloud? How many
> simultaneous connections?...) You will find a lot of info in old
> messages on the subject but, annotating the docs: If the kernel is
> enforcing a safe per-process limit, you don't need to worry about this
> setting. But on some platforms (notably, most BSD systems - looking at
> you Mac), the kernel will allow individual processes to open many more
> files than the system can actually support if many processes all try
> to open that many files....
>
> An old email from Tom Lane notes that you need to make sure your
> kernel can support approximately
> max_connections * (max_files_per_process + max_connections) open file
> handles plus any requirements imposed by other processes on the system
> and comments that Mac treats each semaphore as an open file.
>
> My interpretation is that if your kernel enforces things properly you
> don't need to worry. If it doesn't, reduce your max_connections and/or
> max_files_per_process as needed.

We are running virtual machines with Ubuntu Server 12.04.1 LTS. The
error looks very confusing given:

root@DB1:~# cat /proc/sys/fs/file-max
1621645

root@DB1:~# lsof | wc -l
7013

It is not present normally in our logs, that's why I am suspecting that
it has some correlation with the WAL issue above. Could it be that the
master was not able to open properly the wal sender for this transaction?

--
Kouber Saparev

-- 
Sent via pgsql-admin mailing list (pgsql-admin@xxxxxxxxxxxxxx)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin