Re: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock

Thierry Bordaz <tbordaz@xxxxxxxxxx> · Wed, 28 Jul 2021 16:19:29 +0200

On 7/28/21 3:47 PM, Kees Bakker wrote:
When you said:
> You may confirm that with a 'grep 60fe8535001000030000 
<rotte,linge,iparep4>/var/log/dirsrv/<instance>/access*' => err=1

On linge there is one hit with err=1, quickly followed by a hit with 
err=0.
Is that a confirmation that replication succeeded after a retry?

Yes that was a typo the update completed successfully everywhere with err=0

On 28-07-2021 14:36, Thierry Bordaz wrote:
Hi Kees,

Rotte successfully processed the problematic update
(60fe8535001000030000), updating the database and recording the update
in the changelog.

Later Rotte tried to replicate the update to linge  but the update
failed on linge

[26/Jul/2021:11:44:37.947738548 +0200] - ERR - NSMMReplicationPlugin -
changelog program - _cl5WriteOperationTxn - retry (49) the transaction
(csn=60fe8535001000030000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK:
Locker killed to resolve a deadlock))

Rotte noticed this failure

[26/Jul/2021:11:44:39.055890736 +0200] - WARN - NSMMReplicationPlugin -
repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
(linge:389): Consumer failed to replay change (uniqueid
31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535001000030000):
Operations error (1). Will retry later

And like mentioned in the log it retried later to replicate the update
and this time it succeeded. You said the value was correct on all
replicas. You may confirm that with a 'grep 60fe8535001000030000
<rotte,linge,iparep4>/var/log/dirsrv/<instance>/access*' => err=1

The reason of the original replication failure (on linge) is possibly
related to the deadlock policy. By default DS, in case of DB deadlock,
gives the priority to the youngest transaction and abort the others txn
to resolve a deadlock. This default value works fine but in case of IPA
where updates are very often nested (because of many plugins calls) it
is not optimal. you may try nsslapd-db-deadlock-policy: 6 (priority to
writers).

DB_LOCK_DEADLOCK is a normal event. The server just retries. In case of
too many retry, the operation itself fails. Replication just sends again
the failing operation. ATM your topology looks healthy you may try to
update the deadlock policy.

Regards
thierry

On 7/28/21 2:10 PM, Kees Bakker wrote:
Hi,

This is in a IPA deployment. We have three masters/replicas in a
triangular topology, A-B, B-C, C-A.
The systems are called: rotte, linge and iparep4.

rotte is CentOS 7, with 389-ds-base-1.3.9.1-13.el7_7.x86_64
linge and iparep4 are CentOS 8 Stream, with
389-ds-base-1.4.3.23-2.module_el8.5.0+835+5d54734c.x86_64

Yesterday I removed some members from a user group on rotte. This
caused the follow errors
on linge (and on iparep4).

Jul 26 11:44:37 linge.example.com ns-slapd[282944]:
[26/Jul/2021:11:44:37.947738548 +0200] - ERR - NSMMReplicationPlugin -
changelog program - _cl5WriteOperationTxn - retry (49) the transaction
(csn=60fe8535001000030000) failed (rc=-30993 (BDB0068
DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))
Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
[26/Jul/2021:11:44:38.000964611 +0200] - ERR - NSMMReplicationPlugin -
changelog program - _cl5WriteOperationTxn - Failed to write entry with
csn (60fe8535001000030000); db error - -30993 BDB0068
DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock
Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
[26/Jul/2021:11:44:38.025996273 +0200] - ERR - NSMMReplicationPlugin -
write_changelog_and_ruv - Can't add a change for
cn=vpn_users,cn=groups,cn=accounts,dc=example,dc=com (uniqid:
31283c01-a16511e9-93cf90e8-ab7c8ee8, optype: 8) to changelog csn
60fe8535001000030000
Jul 26 11:44:38 linge.example.com ns-slapd[282944]:
[26/Jul/2021:11:44:38.062640602 +0200] - ERR - NSMMReplicationPlugin -
process_postop - Failed to apply update (60fe8535001000030000) error
(1).  Aborting replication session(conn=53596 op=65)

On rotte

jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
[26/Jul/2021:11:44:39.055890736 +0200] - WARN - NSMMReplicationPlugin
- repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
(linge:389): Consumer failed to replay change (uniqueid
31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535001000030000):
Operations error (1). Will retry later.
jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
[26/Jul/2021:11:44:39.058198988 +0200] - WARN - NSMMReplicationPlugin
- repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com"
(linge:389): Consumer failed to replay change (uniqueid
31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535003300030000):
Operations error(1). Will retry later.
jul 26 11:44:39 rotte.example.com ns-slapd[2705]:
[26/Jul/2021:11:44:39.069825407 +0200] - ERR - NSMMReplicationPlugin -
release_replica - agmt="cn=meTolinge.example.com" (linge:389): Unable
to send endReplication extended operation (Operations error)
jul 26 11:44:46 rotte.example.com ns-slapd[2705]:
[26/Jul/2021:11:44:46.561562313 +0200] - INFO - NSMMReplicationPlugin
- bind_and_check_pwp - agmt="cn=meTolinge.example.com" (linge:389):
Replication bind with GSSAPI auth resumed

As far as I can see the user group is correctly modified on all
replicas. But it doesn't
look healthy to me.

Is there anything I can do to see what went wrong? Is there something
to improve
in the configuration?
_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure
_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: 
https://pagure.io/fedora-infrastructure
_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure