> Are there any routers, middlewares, firewalls, idp's etc between the
> client/ldap server? Load balancer?
When this first started happening, the client a cluster of containers
just spoke to ldap server directly over a peering connection. Since the
error was unable to connect to ldap, I thought perhaps the one ldap
server could not handle it. So I added a load balancer (AWS NLB) and a
second ldap server. It didnt help. Since this was happening before the
load balancer, I dont think its that. There is a ALB in front of the
cluster.
-Gary
On 10/16/24 09:24, Gary Waters wrote:
>
>>
>> Are there any routers, middlewares, firewalls, idp's etc between the
>> client/ldap server? Load balancer?
>
> When this first started happening, the client a cluster of containers
> just spoke to ldap server directly over a peering connection. Since
> the error was unable to connect to ldap, I thought perhaps the one
> ldap server could not handle it. So I added a load balancer (AWS NLB)
> and a second ldap server. It didnt help. Since this was happening
> before the load balancer, I dont think its that. There is a ALB in
> front of the cluster.
>
> -Gary
>
> On 10/15/24 17:26, William Brown wrote:
>>>
>>>> These errors are only shown on the client, yes? Is there any
>>>> evidence of a failed connection in the access log?
>>> Correct, those 2 different contacting ldap error issues. I have
>>> searched for various things in the logs, but I havent read it line
>>> by line. I dont see "err=1", no fd errors, or "Not listening for new
>>> connections - too many fds open".
>>
>> So, that means the error is happening *before* 389-ds gets a chance
>> to accept on the connection.
>>
>>>> We encountered a similar issue recently with another load test,
>>>> where the load tester wasn't averaging it's connections, it would
>>>> launch 10,000 connections at once and hope they all worked. With
>>>> your load test, is it actually spreading it's connections out, or
>>>> is it bursting?
>>> It's a ramp up of 500 users logging in and starting their searches,
>>> the initial ramp up is 60 seconds, but the searches and
>>> login/logouts is over 6 minutes. I just spliced up the logs to see
>>> what that first minute was like:
>>>
>>> Peak Concurrent Connections: 689
>>> Total Operations: 18770
>>> Total Results: 18769
>>> Overall Performance: 100.0%
>>>
>>> Total Connections: 2603 (21.66/sec) (1299.40/min)
>>> - LDAP Connections: 2603 (21.66/sec) (1299.40/min)
>>> - LDAPI Connections: 0 (0.00/sec) (0.00/min)
>>> - LDAPS Connections: 0 (0.00/sec) (0.00/min)
>>> - StartTLS Extended Ops: 2571 (21.39/sec) (1283.42/min)
>>>
>>> Searches: 13596 (113.12/sec) (6787.01/min)
>>> Modifications: 0 (0.00/sec) (0.00/min)
>>> Adds: 0 (0.00/sec) (0.00/min)
>>> Deletes: 0 (0.00/sec) (0.00/min)
>>> Mod RDNs: 0 (0.00/sec) (0.00/min)
>>> Compares: 0 (0.00/sec) (0.00/min)
>>> Binds: 2603 (21.66/sec) (1299.40/min)
>>>
>>> With these settings below, the test results are in, they still get 1
>>> ldap error per test.
>>>
>>> net.ipv4.tcp_max_syn_backlog = 8192
>>>
>>> net.core.somaxconn = 8192
>>>
>>> Suggestions ? Should I bump these up more ?
>>>
>> We still don't know what the cause *is* so just tweaking values won't
>> help. We need to know what layer is triggering the error before we
>> make changes.
>>
>> Reading these numbers, this doesn't look like the server should be
>> under any stress at all - I have tested with 2cpu / 4gb ram and can
>> easily get 10,000 simultaneous connections launched and accepted by
>> 389-ds.
>>
>> My thinking at this point is there is something in between the client
>> and 389 that is not coping.
>>
>>
>>
>> --
>> Sincerely,
>>
>> William Brown
>>
>> Senior Software Engineer,
>> Identity and Access Management
>> SUSE Labs, Australia
>>