Are there any routers, middlewares, firewalls, idp's etc between the client/ldap server? Load balancer?When this first started happening, the client a cluster of containers just spoke to ldap server directly over a peering connection. Since the error was unable to connect to ldap, I thought perhaps the one ldap server could not handle it. So I added a load balancer (AWS NLB) and a second ldap server. It didnt help. Since this was happening before the load balancer, I dont think its that. There is a ALB in front of the cluster.
Are there any routers, middlewares, firewalls, idp's etc between the client/ldap server? Load balancer?When this first started happening, the client a cluster of containers just spoke to ldap server directly over a peering connection. Since the error was unable to connect to ldap, I thought perhaps the one ldap server could not handle it. So I added a load balancer (AWS NLB) and a second ldap server. It didnt help. Since this was happening before the load balancer, I dont think its that. There is a ALB in front of the cluster.
-Gary
On 10/15/24 17:26, William Brown wrote:
These errors are only shown on the client, yes? Is there any evidence of a failed connection in the access log?Correct, those 2 different contacting ldap error issues. I have searched for various things in the logs, but I havent read it line by line. I dont see "err=1", no fd errors, or "Not listening for new connections - too many fds open".
So, that means the error is happening *before* 389-ds gets a chance to accept on the connection.
We still don't know what the cause *is* so just tweaking values won't help. We need to know what layer is triggering the error before we make changes.We encountered a similar issue recently with another load test, where the load tester wasn't averaging it's connections, it would launch 10,000 connections at once and hope they all worked. With your load test, is it actually spreading it's connections out, or is it bursting?It's a ramp up of 500 users logging in and starting their searches, the initial ramp up is 60 seconds, but the searches and login/logouts is over 6 minutes. I just spliced up the logs to see what that first minute was like:Peak Concurrent Connections: 689
Total Operations: 18770
Total Results: 18769
Overall Performance: 100.0%
Total Connections: 2603 (21.66/sec) (1299.40/min)
- LDAP Connections: 2603 (21.66/sec) (1299.40/min)
- LDAPI Connections: 0 (0.00/sec) (0.00/min)
- LDAPS Connections: 0 (0.00/sec) (0.00/min)
- StartTLS Extended Ops: 2571 (21.39/sec) (1283.42/min)
Searches: 13596 (113.12/sec) (6787.01/min)
Modifications: 0 (0.00/sec) (0.00/min)
Adds: 0 (0.00/sec) (0.00/min)
Deletes: 0 (0.00/sec) (0.00/min)
Mod RDNs: 0 (0.00/sec) (0.00/min)
Compares: 0 (0.00/sec) (0.00/min)
Binds: 2603 (21.66/sec) (1299.40/min)
With these settings below, the test results are in, they still get 1 ldap error per test.
net.ipv4.tcp_max_syn_backlog = 8192
net.core.somaxconn = 8192
Suggestions ? Should I bump these up more ?
Reading these numbers, this doesn't look like the server should be under any stress at all - I have tested with 2cpu / 4gb ram and can easily get 10,000 simultaneous connections launched and accepted by 389-ds.
My thinking at this point is there is something in between the client and 389 that is not coping.
--
Sincerely,
William Brown
Senior Software Engineer,
Identity and Access Management
SUSE Labs, Australia
-- _______________________________________________ 389-users mailing list -- 389-users@xxxxxxxxxxxxxxxxxxxxxxx To unsubscribe send an email to 389-users-leave@xxxxxxxxxxxxxxxxxxxxxxx Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@xxxxxxxxxxxxxxxxxxxxxxx Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue