Re: DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi Kees,

Rotte successfully processed the problematic update (60fe8535001000030000), updating the database and recording the update in the changelog.

Later Rotte tried to replicate the update to linge  but the update failed on linge

[26/Jul/2021:11:44:37.947738548 +0200] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - retry (49) the transaction (csn=60fe8535001000030000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock))

Rotte noticed this failure

[26/Jul/2021:11:44:39.055890736 +0200] - WARN - NSMMReplicationPlugin - repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com" (linge:389): Consumer failed to replay change (uniqueid 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535001000030000): Operations error (1). Will retry later

And like mentioned in the log it retried later to replicate the update and this time it succeeded. You said the value was correct on all replicas. You may confirm that with a 'grep 60fe8535001000030000 <rotte,linge,iparep4>/var/log/dirsrv/<instance>/access*' => err=1

The reason of the original replication failure (on linge) is possibly related to the deadlock policy. By default DS, in case of DB deadlock, gives the priority to the youngest transaction and abort the others txn to resolve a deadlock. This default value works fine but in case of IPA where updates are very often nested (because of many plugins calls) it is not optimal. you may try nsslapd-db-deadlock-policy: 6 (priority to writers).

DB_LOCK_DEADLOCK is a normal event. The server just retries. In case of too many retry, the operation itself fails. Replication just sends again the failing operation. ATM your topology looks healthy you may try to update the deadlock policy.

Regards
thierry


On 7/28/21 2:10 PM, Kees Bakker wrote:
Hi,

This is in a IPA deployment. We have three masters/replicas in a triangular topology, A-B, B-C, C-A.
The systems are called: rotte, linge and iparep4.

rotte is CentOS 7, with 389-ds-base-1.3.9.1-13.el7_7.x86_64
linge and iparep4 are CentOS 8 Stream, with 389-ds-base-1.4.3.23-2.module_el8.5.0+835+5d54734c.x86_64

Yesterday I removed some members from a user group on rotte. This caused the follow errors
on linge (and on iparep4).

Jul 26 11:44:37 linge.example.com ns-slapd[282944]: [26/Jul/2021:11:44:37.947738548 +0200] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - retry (49) the transaction (csn=60fe8535001000030000) failed (rc=-30993 (BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock)) Jul 26 11:44:38 linge.example.com ns-slapd[282944]: [26/Jul/2021:11:44:38.000964611 +0200] - ERR - NSMMReplicationPlugin - changelog program - _cl5WriteOperationTxn - Failed to write entry with csn (60fe8535001000030000); db error - -30993 BDB0068 DB_LOCK_DEADLOCK: Locker killed to resolve a deadlock Jul 26 11:44:38 linge.example.com ns-slapd[282944]: [26/Jul/2021:11:44:38.025996273 +0200] - ERR - NSMMReplicationPlugin - write_changelog_and_ruv - Can't add a change for cn=vpn_users,cn=groups,cn=accounts,dc=example,dc=com (uniqid: 31283c01-a16511e9-93cf90e8-ab7c8ee8, optype: 8) to changelog csn 60fe8535001000030000 Jul 26 11:44:38 linge.example.com ns-slapd[282944]: [26/Jul/2021:11:44:38.062640602 +0200] - ERR - NSMMReplicationPlugin - process_postop - Failed to apply update (60fe8535001000030000) error (1).  Aborting replication session(conn=53596 op=65)

On rotte

jul 26 11:44:39 rotte.example.com ns-slapd[2705]: [26/Jul/2021:11:44:39.055890736 +0200] - WARN - NSMMReplicationPlugin - repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com" (linge:389): Consumer failed to replay change (uniqueid 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535001000030000): Operations error (1). Will retry later. jul 26 11:44:39 rotte.example.com ns-slapd[2705]: [26/Jul/2021:11:44:39.058198988 +0200] - WARN - NSMMReplicationPlugin - repl5_inc_update_from_op_result - agmt="cn=meTolinge.example.com" (linge:389): Consumer failed to replay change (uniqueid 31283c01-a16511e9-93cf90e8-ab7c8ee8, CSN 60fe8535003300030000): Operations error(1). Will retry later. jul 26 11:44:39 rotte.example.com ns-slapd[2705]: [26/Jul/2021:11:44:39.069825407 +0200] - ERR - NSMMReplicationPlugin - release_replica - agmt="cn=meTolinge.example.com" (linge:389): Unable to send endReplication extended operation (Operations error) jul 26 11:44:46 rotte.example.com ns-slapd[2705]: [26/Jul/2021:11:44:46.561562313 +0200] - INFO - NSMMReplicationPlugin - bind_and_check_pwp - agmt="cn=meTolinge.example.com" (linge:389): Replication bind with GSSAPI auth resumed

As far as I can see the user group is correctly modified on all replicas. But it doesn't
look healthy to me.

Is there anything I can do to see what went wrong? Is there something to improve
in the configuration?
_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam on the list, report it: https://pagure.io/fedora-infrastructure




[Index of Archives]     [Fedora User Discussion]     [Older Fedora Users]     [Fedora Announce]     [Fedora Package Announce]     [EPEL Announce]     [Fedora News]     [Fedora Cloud]     [Fedora Advisory Board]     [Fedora Education]     [Fedora Security]     [Fedora Scitech]     [Fedora Robotics]     [Fedora Maintainers]     [Fedora Infrastructure]     [Fedora Websites]     [Anaconda Devel]     [Fedora Devel Java]     [Fedora Legacy]     [Fedora Desktop]     [Fedora Fonts]     [ATA RAID]     [Fedora Marketing]     [Fedora Management Tools]     [Fedora Mentors]     [Fedora Package Review]     [Fedora R Devel]     [Fedora PHP Devel]     [Kickstart]     [Fedora Music]     [Fedora Packaging]     [Centos]     [Fedora SELinux]     [Fedora Legal]     [Fedora Kernel]     [Fedora QA]     [Fedora Triage]     [Fedora OCaml]     [Coolkey]     [Virtualization Tools]     [ET Management Tools]     [Yum Users]     [Tux]     [Yosemite News]     [Yosemite Photos]     [Linux Apps]     [Maemo Users]     [Gnome Users]     [KDE Users]     [Fedora Tools]     [Fedora Art]     [Fedora Docs]     [Maemo Users]     [Asterisk PBX]     [Fedora Sparc]     [Fedora Universal Network Connector]     [Fedora ARM]

  Powered by Linux