Re: [389-users] Master caught in infinite loop

Daniel Fenert <daniel@xxxxxxxxx> · Fri, 18 Nov 2011 20:07:16 +0100

W dniu 2011-11-18 19:49, Rich Megginson pisze:
> On 11/18/2011 11:46 AM, Daniel Fenert wrote:
>> W dniu 2011-11-18 14:42, Rich Megginson pisze:
>>> On 11/18/2011 05:08 AM, Daniel Fenert wrote:
>>>> Hi,
>>>>
>>>> I'm using 389ds 1.2.5 with replication, my current setup:
>>>>
>>>> Master
>>>> |     \
>>>> L1     L2
>>>> | \    |  \
>>>> S1 S2 S3  S4
>>>>
>>>> L* - acting as slave to "master" and master to "S*"
>>>> S* - slaves to L*
>>>>
>>>>
>>>>  From time to time (usually few months between problems) we encounter
>>>> "master" going to some infinite loop.
>>>> After analyzing access log, it looks like it stops doing queries, and
>>>> accepts new connections until it runs out of fd's.
>>>> After that, it won't stop peacefully, only SIGKILL saves the day.
>>>>
>>>> Workload:
>>>> Master is used only for updates, maybe 20 connections/s.
>>>> L* are used only for replication.
>>>> All bind's and search queries are targeted to S* which are read only.
>>>>
>>>> With previous setup (less complicated), we've also seen this problem:
>>>> Master
>>>> |  |  |  \
>>>> S1 S2 S3  S4...
>>>>
>>>> Is there a chance that upgrading to latest version will fix the 
>>>> problem?
>>>> Were there any fixes nearby? Upgrade will be complex as hell ;)
>>>>
>>>> Error log from last problem:
>>>>   - Not listening for new connections - too many fds open
>>> Have you tried increasing the number of fds to 8192?
>>
>> Yes, but it doesn't make sense - during normal operation master uses 
>> no more than 50-60 fd's.
> Right.  I'm not suggesting this is the root cause of the problem, but 
> increasing the number of fds could help reduce the occurance of the 
> problem.

When the number of fd's being used started to grow, it wasn't already 
running queries.
I think giving him more fd's would just delay for a few minutes log 
message that it stopped accepting new connections :)

>>
>>>>   - slapd shutting down - signaling operation threads
>>>>   - slapd shutting down - waiting for 120 threads to terminate
>>> Does the server shutdown on its own, or did you shut it down 
>>> normally (i.e. service dirsrv stop)?
>>
>> We have tried to stop it using init.d scripts.
> 120 threads?  Did you increase nsslapd-threadnumber?
> If not, then I'm very curious about what all those threads are doing.

Yes, we've raised number of threads long time ago - when master was used 
also for queries - when we hit performance problems.
Nowadays these threads just hang and do nothing - I've forgot to take 
the thread number down.

>>
>>>> ... SIGKILL ...
>>>>   - 389-Directory/1.2.5 B2010.012.2034 starting up
>>>>   - Detected Disorderly Shutdown last time Directory Server was 
>>>> running,
>>>> recovering database.
>>>>   - slapd started.  Listening on All Interfaces port 389 for LDAP 
>>>> requests
>>>>
>>>> Number of fds: 4096.
>>> Since 1.2.5 we have fixed a number of bugs around connection 
>>> handling.  You might find that 1.2.9.9 (current stable version) 
>>> works much better for you.
>>
>> OK, we'll try to upgrade.
>>
>> How to upgrade such complex setup?
>> Should we try top-to-bottom approach (master first, then L*, then S*) 
>> or bottom-to-top (S*, L*, master last)?
> bottom to top

Thanks, we'll try in the next weeks.

>> Shutting down all servers is not really an option.
>>
>

--
389 users mailing list
389-users@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/389-users