Re: [389-users] Master caught in infinite loop

Rich Megginson <rmeggins@xxxxxxxxxx> · Fri, 18 Nov 2011 11:49:12 -0700



On 11/18/2011 11:46 AM, Daniel Fenert wrote:
> W dniu 2011-11-18 14:42, Rich Megginson pisze:
>> On 11/18/2011 05:08 AM, Daniel Fenert wrote:
>>> Hi,
>>>
>>> I'm using 389ds 1.2.5 with replication, my current setup:
>>>
>>> Master
>>> |     \
>>> L1     L2
>>> | \    |  \
>>> S1 S2 S3  S4
>>>
>>> L* - acting as slave to "master" and master to "S*"
>>> S* - slaves to L*
>>>
>>>
>>>  From time to time (usually few months between problems) we encounter
>>> "master" going to some infinite loop.
>>> After analyzing access log, it looks like it stops doing queries, and
>>> accepts new connections until it runs out of fd's.
>>> After that, it won't stop peacefully, only SIGKILL saves the day.
>>>
>>> Workload:
>>> Master is used only for updates, maybe 20 connections/s.
>>> L* are used only for replication.
>>> All bind's and search queries are targeted to S* which are read only.
>>>
>>> With previous setup (less complicated), we've also seen this problem:
>>> Master
>>> |  |  |  \
>>> S1 S2 S3  S4...
>>>
>>> Is there a chance that upgrading to latest version will fix the 
>>> problem?
>>> Were there any fixes nearby? Upgrade will be complex as hell ;)
>>>
>>> Error log from last problem:
>>>   - Not listening for new connections - too many fds open
>> Have you tried increasing the number of fds to 8192?
>
> Yes, but it doesn't make sense - during normal operation master uses 
> no more than 50-60 fd's.
Right.  I'm not suggesting this is the root cause of the problem, but 
increasing the number of fds could help reduce the occurance of the problem.
>
>>>   - slapd shutting down - signaling operation threads
>>>   - slapd shutting down - waiting for 120 threads to terminate
>> Does the server shutdown on its own, or did you shut it down normally 
>> (i.e. service dirsrv stop)?
>
> We have tried to stop it using init.d scripts.
120 threads?  Did you increase nsslapd-threadnumber?
If not, then I'm very curious about what all those threads are doing.
>
>>> ... SIGKILL ...
>>>   - 389-Directory/1.2.5 B2010.012.2034 starting up
>>>   - Detected Disorderly Shutdown last time Directory Server was 
>>> running,
>>> recovering database.
>>>   - slapd started.  Listening on All Interfaces port 389 for LDAP 
>>> requests
>>>
>>> Number of fds: 4096.
>> Since 1.2.5 we have fixed a number of bugs around connection 
>> handling.  You might find that 1.2.9.9 (current stable version) works 
>> much better for you.
>
> OK, we'll try to upgrade.
>
> How to upgrade such complex setup?
> Should we try top-to-bottom approach (master first, then L*, then S*) 
> or bottom-to-top (S*, L*, master last)?
bottom to top
> Shutting down all servers is not really an option.
>

--
389 users mailing list
389-users@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/389-users