Re: 389-ds freezes with deadlock

Thierry Bordaz <tbordaz@xxxxxxxxxx> · Wed, 13 Sep 2023 10:42:16 +0200

On 9/13/23 09:57, Julian Kippels wrote:
Hi Thierry,

> First you may install debuginfo it would help to get a better
> understanding what happens.

I will try to do that the next time it breaks. Unfortunately this is a 
production machine and I can't always take the time to do forensics. 
Sometimes I just have to quickly get it up running again and just 
restart the service completely. I have not yet found a way to trigger 
this in my lab environment.

> Do you know if it recovers after that high CPU peak ?

So far it has never recovered. I have seen the high CPU peak 7 or 8 
times now and it is always like this:
1. CPU usage peaks on 2 threads
ATM I assume it would be a MOD eating CPU while writing back to the 
changelog + trickling. You may get pstacks + top -H to confirm this
2. Admin from external server tells me that his system cannot do LDAP 
operations anymore.
stacktrace shows many updates waiting for the above MOD to complete. In 
extreme case, the pending MODs may exhaust workers make the server 
unresponsive
3. I try to do some ldapmodify operations, which succeed and get 
replicated correctly.
This is surprising, I would expect a MOD, on the server where a MOD is 
busy on CL, to hang as well. Are you doing your update on the same backend ?
4. At this point there are 2 options:
  a. Both the admin from the external server and I restart our 
services which temporarily fixes the issue
  b. I don't restart my system and after a few hours (where the CPU 
peak does not go away) dirsrv completely freezes up and does not 
accept any connections anymore.

You may look at the started MOD in the access log and check which one 
was hanging. Then compare its etime (using the csn) on the others servers.

Does it occur always on the same server ?

regards
thierry

> Regarding the unindexed search, you may check if 'changeNumber' is
> indexed (equality). It looks related to a sync_repl search with no
> cookie or old cookie. The search is on a different backend than Thread
> 62, so there is no conflict between the sync_repl unindexed search and
> update on thread62.

The equality index is set for changeNumber. I will assume that this is 
a different "problem" and has nothing to do with the high cpu load and 
freezes and not look further into it for the time.

Kind regards
Julian

Am 12.09.23 um 14:21 schrieb Thierry Bordaz:
Hi Julian,

Difficult to say. I do not recall specific issue but I know we fixed 
several bugs in sync_repl.

First you may install debuginfo it would help to get a better 
understanding what happens.

The two threads are likely Thread 62 and trickle thread (2 to 6) 
because of intensive db page update.
Do you know if it recovers after that high CPU peak ?
A possibility would be a large update to write back to the changelog. 
You may retrieve the problematic csn in access log (during high cpu) 
and dump the update from the changelog with dbscan (-k).

Regarding the unindexed search, you may check if 'changeNumber' is 
indexed (equality). It looks related to a sync_repl search with no 
cookie or old cookie. The search is on a different backend than 
Thread 62, so there is no conflict between the sync_repl unindexed 
search and update on thread62.

best regards
thierry

On 9/12/23 13:52, Julian Kippels wrote:
Hi,

there are two threads that are at 100% CPU utilisation. I did not 
start any admin task myself, maybe it is some built-in task that is 
doing this? Or could an unindexed search on the changelog be causing 
this?

I have noticed this message:
NOTICE - ldbm_back_search - Unindexed search: search 
base="cn=changelog" scope=1 filter="(changeNumber>=1)" conn=35871 op=1

There is an external server that is reading the changelog and 
syncing some stuff depending on that. I don't know why they are 
starting at changeNumber>=1, they probably should start way higher. 
If it is possible that this is the cause I will kick them to stop 
that ;)

I am running version 2.3.1 on Debian 12, installed from the Debian 
repositories.

Kind regards
Julian

Am 08.09.23 um 13:23 schrieb Thierry Bordaz:
Hi Julian,

It looks that an update (Thread 62) is either eating CPU either is 
blocked while update the changelog.
When it occurs could you run 'top -H -p <pid>' to see if some 
thread are eating CPU.
Else (no cpu consumption), you may take a pstack and dump DB lock 
info (db_stat -N -C A -h /var/lib/dirsrv/<inst>db)

Did you run admin task (import/export/index...) before it occurred ?
What version are you running ?

best regards
Thierry

On 9/8/23 09:28, Julian Kippels wrote:
Hi,

it happened again and now I ran the gdb-command like Mark 
suggested. The Stacktrace is attached. Again I got this error 
message:

[07/Sep/2023:15:22:43.410333038 +0200] - ERR - ldbm_back_seq - 
deadlock retry BAD 1601, err=0 Unexpected dbimpl error code

and the remote program that called also stopped working at that time.

Thanks
Julian Kippels

Am 28.08.23 um 14:28 schrieb Thierry Bordaz:
Hi Julian,

I agree with Mark suggestion. If new connections are failing a 
pstack + error logged msg would be helpful.

Regarding the error logged. LDAP server relies on a database 
that, under pressure by multiple threads, may end into a db_lock 
deadlock. In such situation the DB, selects one deadlocking 
thread, returns a DB_Deadlock error to that thread while the 
others threads continue to proceed. This is very normal error 
that is caught by the server that simply retries to access the 
DB. If the same thread fails to many time, it stops retry and 
return a fatal error to the request.

In your case it reports code 1601 that is transient deadlock with 
retry. So the impacted request just retried and likely succeeded.

best regards
thierry

On 8/24/23 14:46, Mark Reynolds wrote:
Hi Julian,

It would be helpful to get a pstack/stacktrace so we can see 
where DS is stuck:

https://www.port389.org/docs/389ds/FAQ/faq.html#sts=Debugging%C2%A0Hangs 

Thanks,
Mark

On 8/24/23 4:13 AM, Julian Kippels wrote:
Hi,

I am using 389-ds Version 2.3.1 and have encountered the same 
error twice in three days now. There are some MOD operations 
and then I get a line like this in the errors-log:

[23/Aug/2023:13:27:17.971884067 +0200] - ERR - ldbm_back_seq - 
deadlock retry BAD 1601, err=0 Unexpected dbimpl error code

After this the server keeps running, systemctl status says 
everything is fine, but new incoming connections are failing 
with timeouts.

Any advice would be welcome.

Thanks in advance
Julian Kippels

_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 
389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: 
https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 
389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: 
https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 
389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: 
https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue
_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 
389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: 
https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: 
https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: 
https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: 
https://pagure.io/fedora-infrastructure/new_issue

_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
_______________________________________________
389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx
To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue