I have 6 replicas (two of which are read-only). I ran into
an issue where a DELETE operation failed on a server with error
code 51 (ldap busy).
[21/Oct/2014:23:44:44 -0400] conn=78160
op=39510 RESULT err=51 tag=107 nentries=0 etime=3
csn=5447282c000300050000
The application retried the delete several times for a couple
of hours (while the server wasn’t getting any other requests)
and the result was always the same (err=51). Each time that
happened, the error log had the following:
[21/Oct/2014:23:44:44 -0400] - Retry
count exceeded in delete
My first question is, what would cause a problem like this?
I simply restarted that directory and then the update
succeeded. However, when the update went to the other 5
servers, they failed in the same way and the same error was
logged in their log files. But the update wasn’t retried. It
was just skipped and future updates via replication succeeded on
those 5 servers.
My second question is, what’s the best way to monitor for
these types of replication errors? In this
case, nsds5replicaLastUpdateStatus did not indicate a problem.
If I had not been looking at the error file on those 5 hosts,
I’m wondering how I would have known that a delete failed to
replicate to them. If the answer is to just have something
monitoring the error log files, are there specific search
strings to look for to separate out updates that have failed and
won’t be retried from other errors (e.g. temporary connection
issues)? Just curious if there is a best practice here.
Thanks!
— Shilen
--
389 users mailing list
389-users@xxxxxxxxxxxxxxxxxxxxxxx
https://admin.fedoraproject.org/mailman/listinfo/389-users