Re: [PATCH] prevent slapd from hanging under unlikely circumstances

William Brown <wbrown@xxxxxxx> · Tue, 4 Feb 2020 10:39:48 +1000

> On 3 Feb 2020, at 23:43, Jay Fenlason <ds389@xxxxxxxxxxxxxxx> wrote:
> 
> On Mon, Feb 03, 2020 at 10:38:59AM +1000, William Brown wrote:
>> 
>> 
>>> On 1 Feb 2020, at 12:10, Jay Fenlason <ds389@xxxxxxxxxxxxxxx> wrote:
>>> 
>>> I have a small FreeIPA deployment of ~6-8 servers running on Centos
>>> 7.7.  Do to the addition and removal of some of the servers, some
>>> cruft (tombstones, replication conflicts, etc) have crept in to the
>>> directory.  I noticed that when I attempted to delete some of the
>>> cruft entries, ns-slapd would hang, failirg to process requests, or
>>> even shut down.
> 
>> Can you tell us exactly what entries you noticed and how you attempted to delete them? There are certainly some things like tombstones and such that you shouldn't be touching as they are part of the internal replication state machine.
> 
> No, I don't remember what entries they were.  I was following
> instructions from:
> https://docs.fedoraproject.org/en-US/Fedora/18/html/FreeIPA_Guide/ipa-replica-manage.html
> (or maybe elsewhere) using ldapdelete to remove tombstones for a truly
> deleted server.

I think this advice is quite out dated now, we have potentially got better tools to handle this, but that's a really difficult issue to manage content ownership, management, and getting google to show the latest content etc .... 

> 
>> Knowing what you did will also help us to create a test case and
>> reproducers to validate your patch also.
> 
> I found the bug by doing a series of "ipa-client-install" (with lots
> of arguments, followed by
> echo ca_host = {a not-firewalled IPA CA} >> /etc/ipa/default.conf
> echo [global] > /etc/ipa/installer.conf
> echo ca_host = {ditto} >> /etc/ipa/installer.conf
> echo {password} | kinit admin
> ipa hostgroup-add-member ipaservers --hosts $(hostname -f)
> ipa-relica-install --setup-ca --setup-dns --forwarder={ip addr}
> 
> followed by the replica install failing due to network issues,
> misconfigured firewalls, etc, then
> ipa-server-install --uninstall on the host
> and ipa-replica-manage del {failed install host}
> elsewhere in the mesh, sometimes with ldapdelete of the initial
> replication agreement that ipa-replica-manage did not remove.
> 
> Rinse, repeat. . .
> 
> Until ipa-replica-install starts failing because the source LDAP
> server hangs (because of this bug) during the "starting initial
> replication" step.  It was while debugging that that I discovered that
> ldapdelete on the tombstone entries also caused the LDAP servers to
> lock up.
> 
> 
>> Thanks for the report :) 
> 
> Incidentally, there's another bug, which I have not investigated,
> where attempting to ldapdelete a problematic tombstone entry
> immediately after restarting the LDAP server returns an error, and
> nothing is deleted on the server.  If you do an ldapsearch, and then
> an ldapdelete, the entry is removed, but then slapd hangs (this bug
> again) and does not respond to searches or deletes (or shutdown
> requests) until you kill -9 it.  I don't know how it relates to this
> bug.

So I think that deleting the tombstones is not the correct (or valid) course of action here. Tombstones are a really important part of the replication lifecycle, so if anything we need to taker stronger steps to prevent a client from being able to delete them at all. This makes me question the patch you have provided, because you shouldn't be in a position to delete tombstones in the first place, only the server internally should be purging (deleting) these when replication is known to be in a consistent state. I am happy to explain the functionality of tombstones further if you are interested. 

Deleting a conflict entry however, is just fine, so that shouldn't have caused the issue. 

I wonder if a contributing factor here is if ipa-replica-install is re-using replica ids, which could cause replication to have a problem.

Perhaps the solution here is to have ipa-replica-install to attempt a cleanallruv on any replica id it's *about* to try to use, in case it has been re-used. 

My thinking at this point is that there is something else going on, and the issue may reside in a series of interactions in the ipa replica steps you have taken. Have you contacted the freeipa-users group about this at all? 

> 
>    -- JF
> _______________________________________________
> 389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx
> To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedoraproject.org/archives/list/389-devel@xxxxxxxxxxxxxxxxxxxxxxx

—
Sincerely,

William Brown

Senior Software Engineer, 389 Directory Server
SUSE Labs
_______________________________________________
389-devel mailing list -- 389-devel@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-devel-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-devel@xxxxxxxxxxxxxxxxxxxxxxx