Hi Gary,
I meant that the access logs covered 5hours. It would be helpful to capture/focus on the logs from the few minutes before/after the time when the problem occurred. Then check from those limited logs if there is a pattern or unexpected events (long operation, no operation, abandon, ...)
best regards thierry
On 10/7/24 7:37 PM, Gary Waters wrote:
Hi Thierry,
Ok I ll decrease the timeout to 15 seconds then.
Reducing the size of the logs will help.
Which log and how do I do this ?
Thanks Marc and Theirry!
-Gary
On 10/7/24 00:26, Thierry Bordaz wrote:
Hi,
Those slap_poll error means that the server was unable to send back PDU to the client. It can occur if the client sends a request and does not read fast enough the results. The timeout is high 30s (30000), could it be that the problem is on the client side (app) ?
I suggest that you focus on the timestamp when the application reports a failure. Then look in the access/error logs from 1-3min before and after the time of the failure. Logconv from that limited scope will be more helpful than a global one.
The pattern looks to be an app opens a connection, switch to secure connection (start-tls), issue 6-8 SRCH then close. etime/wtime/optime looks fine but as it is an average (over 1M op) it is not helpful. Reducing the size of the logs will help. I found interesting the abandon op as it is possibly related to a performance issue.
best regards thierry
On 10/4/24 11:54 PM, Gary Waters via 389-users wrote:
Hi Marc,
I have made nsslapd-listen-backlog-size to 512.
For the ioblocktimeout, I increased it because of an error I was seeing:
[30/Sep/2024:16:26:55.987681019 -0700] - ERR - slapd_poll - (743) - Timed out [30/Sep/2024:16:34:49.646922635 -0700] - ERR - slapd_poll - (568) - Timed out
Googling stated that I should increase the ioblocktimeout. So I bumped it up from 20000 to 30000.
Since then, those slapd_poll timed out errors have not occurred. Should I have changed something else?
What should I increase these to?
net.core.somaxconn = 4096 net.ipv4.tcp_max_syn_backlog = 4096
Thanks so much for your help!
-Gary
On 10/4/24 11:55, Marc Sauton wrote:
tune up nsslapd-listen-backlog-size and verify the net.core.somaxconn and net.ipv4.tcp_max_syn_backlog are high enough ( sysctl -a ) possibly tune down the nsslapd-ioblocktimeout value Thanks, M.
On Fri, Oct 4, 2024 at 11:06 AM gwaters-web--- via 389-users 389-users@lists.fedoraproject.org wrote:
Hello, We are experiencing a new issue since we upgraded from 389-ds-base from 1.4~ish to 2.0.15 on RHEL 8. I couldnt figure how to fix it, so I switched to RHEL9 and are on 2.4.5-9. The issue is during a performance load test of a web application. The app logs into a website and does some things that searches against ldap, and does some transactions. This app has been performing fine for years, the app has changed so it could be something there, but I am not sure about that because of the percentage of the traffic that is successful. The errors for the web app are "Can't contact Ldap Server" and sometimes "Can't contact LDAP server. Start TLS request accepted.Server willing to negotiate SSL. (0xFFFF [-1])". Out of the 128k connections below, these errors will happen like 5 or 6 times, so its wildly inconsistent and random. I did a logconv analysis with 6 hours of a day of testing, see below. One thing that really stood out to me was the peak concurrent connections = 22.. That peak is so low, I dont know how these errors are happening. I dont see any errors in the access log ( grepping for err=1). I looked for cache warnings/errors in the access/errors logs, but didnt find any. I dont see things like unavailable connections in the access logs. Suggestions on what to change or look for in the logs ? Thanks, Gary information: Machine Size: 16G of ram, 4 core AMD (its an EC2.m5.large, gp3 disk type) kernel: Linux 5.14.0-427.35.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC packages: 389-ds-base-libs-2.4.5-9.el9_4.x86_64 389-ds-base-2.4.5-9.el9_4.x86_64 single instance of dirsrv running dirsrv modifcations from default: nsslapd-logging-backend: dirsrv-log,syslog nsslapd-maxdescriptors: 8192 nsslapd-listen-backlog-size: 256 nsslapd-allow-hashed-passwords: on nsslapd-idletimeout: 30 nsslapd-ioblocktimeout: 30000 nsslapd-sizelimit: -1 nsslapd-auditlog-logging-enabled: off nsslapd-lookthroughlimit: -1 dirsrv.systemd: limitNOFILE=8192 >Total Log Lines Analszed: 2694287 > > > > ---------- Access Log Output ------------ > > Start of Logs: 26/Sep/2024:10:07:32.089983378 > End of Logs: 26/Sep/2024:15:54:29.895403688 > > Processed Log Time: 5 Hours, 46 Minutes, 57.805426688 Seconds > > Restarts: 0 > Secure Protocol Versions: > - TLS1.2 128-bit AES-GCM (123117 connections) > > Peak Concurrent Connections: 22 > Total Operations: 1097043 > Total Results: 1097044 > Overall Performance: 100.0% > > Total Connections: 128646 (6.18/sec) (370.78/min) > - LDAP Connections: 128646 (6.18/sec) (370.78/min) > - LDAPI Connections: 0 (0.00/sec) (0.00/min) > - LDAPS Connections: 0 (0.00/sec) (0.00/min) > - StartTLS Extended Ops: 123116 (5.91/sec) (354.84/min) > > Searches: 845279 (40.60/sec) (2436.22/min) > Modifications: 0 (0.00/sec) (0.00/min) > Adds: 0 (0.00/sec) (0.00/min) > Deletes: 0 (0.00/sec) (0.00/min) > Mod RDNs: 0 (0.00/sec) (0.00/min) > Compares: 0 (0.00/sec) (0.00/min) > Binds: 128647 (6.18/sec) (370.78/min) > > Average wtime (wait time): 0.001560856 > Average optime (op time): 0.003310453 > Average etime (elapsed time): 0.004868040 > > Multi-factor Authentications: 0 > Proxied Auth Operations: 0 > Persistent Searches: 0 > Internal Operations: 0 > Entry Operations: 0 > Extended Operations: 123116 > Abandoned Requests: 1 > Smart Referrals Received: 0 > > VLV Operations: 0 > VLV Unindexed Searches: 0 > VLV Unindexed Components: 0 > SORT Operations: 0 > > Entire Search Base Queries: 0 > Paged Searches: 0 > Unindexed Searches: 0 > Unindexed Components: 0 > Invalid Attribute Filters: 0 > FDs Taken: 128646 > FDs Returned: 129318 > Highest FD Taken: 968 > > Broken Pipes: 0 > Connections Reset By Peer: 0 > Resource Unavailable: 0 > Max BER Size Exceeded: 0 > > Binds: 128647 > Unbinds: 119206 > ------------------------------------- > - LDAP v2 Binds: 0 > - LDAP v3 Binds: 128647 > - AUTOBINDs(LDAPI): 0 > - SSL Client Binds: 0 > - Failed SSL Client Binds: 0 > - SASL Binds: 0 > - Dir -- _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
Hi Thierry and Marc,
Ah yes of course. Here is 1 run of their web app load test, it is 6 minutes long, and it should mostly be only the test it self. I will start looking for
We encountered 2 "Can not contact ldap server" errors during this run.
2 cant contact ldap server errors in this run below.
After the run I bumped up these from 4096,
net.ipv4.tcp_max_syn_backlog = 6144 net.core.somaxconn = 6144
Yet we still get the ldap errors (this one and the start tls request error previously mentioned.)
Should I bump up the nsslapd-listen-backlog-size, net.ipv4.tcp_max_syn_backlog, net.core.somaxconn more ?
Thanks,
Gary
----------- Access Log Output ------------
Start of Logs: 08/Oct/2024:09:53:35.810833927 End of Logs: 08/Oct/2024:09:59:52.361830449
Processed Log Time: 0 Hours, 6 Minutes, 16.550998016 Seconds
Restarts: 1 Secure Protocol Versions: - TLS1.2 128-bit AES-GCM (9833 connections)
Peak Concurrent Connections: 689 Total Operations: 86412 Total Results: 86412 Overall Performance: 100.0%
Total Connections: 9933 (26.38/sec) (1582.73/min) - LDAP Connections: 9933 (26.38/sec) (1582.73/min) - LDAPI Connections: 0 (0.00/sec) (0.00/min) - LDAPS Connections: 0 (0.00/sec) (0.00/min) - StartTLS Extended Ops: 9833 (26.11/sec) (1566.80/min)
Searches: 66647 (176.99/sec) (10619.60/min) Modifications: 0 (0.00/sec) (0.00/min) Adds: 0 (0.00/sec) (0.00/min) Deletes: 0 (0.00/sec) (0.00/min) Mod RDNs: 0 (0.00/sec) (0.00/min) Compares: 0 (0.00/sec) (0.00/min) Binds: 9932 (26.38/sec) (1582.57/min)
Average wtime (wait time): 0.001407368 Average optime (op time): 0.003186859 Average etime (elapsed time): 0.004591048
Multi-factor Authentications: 0 Proxied Auth Operations: 0 Persistent Searches: 0 Internal Operations: 0 Entry Operations: 0 Extended Operations: 9833 Abandoned Requests: 0 Smart Referrals Received: 0
VLV Operations: 0 VLV Unindexed Searches: 0 VLV Unindexed Components: 0 SORT Operations: 0
Entire Search Base Queries: 0 Paged Searches: 0 Unindexed Searches: 0 Unindexed Components: 0 Invalid Attribute Filters: 0 FDs Taken: 9933 FDs Returned: 9932 Highest FD Taken: 961
Broken Pipes: 0 Connections Reset By Peer: 0 Resource Unavailable: 0 Max BER Size Exceeded: 0
Binds: 9932 Unbinds: 9225 ----------------------------------- - LDAP v2 Binds: 0 - LDAP v3 Binds: 9932 - AUTOBINDs(LDAPI): 0 - SSL Client Binds: 0 - Failed SSL Client Binds: 0 - SASL Binds: 0 - Directory Manager Binds: 0 - Anonymous Binds: 99
On 10/8/24 03:47, Thierry Bordaz wrote:
Hi Gary,
I meant that the access logs covered 5hours. It would be helpful to capture/focus on the logs from the few minutes before/after the time when the problem occurred. Then check from those limited logs if there is a pattern or unexpected events (long operation, no operation, abandon, ...)
best regards thierry
On 10/7/24 7:37 PM, Gary Waters wrote:
Hi Thierry,
Ok I ll decrease the timeout to 15 seconds then.
Reducing the size of the logs will help.
Which log and how do I do this ?
Thanks Marc and Theirry!
-Gary
On 10/7/24 00:26, Thierry Bordaz wrote:
Hi,
Those slap_poll error means that the server was unable to send back PDU to the client. It can occur if the client sends a request and does not read fast enough the results. The timeout is high 30s (30000), could it be that the problem is on the client side (app) ?
I suggest that you focus on the timestamp when the application reports a failure. Then look in the access/error logs from 1-3min before and after the time of the failure. Logconv from that limited scope will be more helpful than a global one.
The pattern looks to be an app opens a connection, switch to secure connection (start-tls), issue 6-8 SRCH then close. etime/wtime/optime looks fine but as it is an average (over 1M op) it is not helpful. Reducing the size of the logs will help. I found interesting the abandon op as it is possibly related to a performance issue.
best regards thierry
On 10/4/24 11:54 PM, Gary Waters via 389-users wrote:
Hi Marc,
I have made nsslapd-listen-backlog-size to 512.
For the ioblocktimeout, I increased it because of an error I was seeing:
[30/Sep/2024:16:26:55.987681019 -0700] - ERR - slapd_poll - (743) - Timed out [30/Sep/2024:16:34:49.646922635 -0700] - ERR - slapd_poll - (568) - Timed out
Googling stated that I should increase the ioblocktimeout. So I bumped it up from 20000 to 30000.
Since then, those slapd_poll timed out errors have not occurred. Should I have changed something else?
What should I increase these to?
net.core.somaxconn = 4096 net.ipv4.tcp_max_syn_backlog = 4096
Thanks so much for your help!
-Gary
On 10/4/24 11:55, Marc Sauton wrote:
tune up nsslapd-listen-backlog-size and verify the net.core.somaxconn and net.ipv4.tcp_max_syn_backlog are high enough ( sysctl -a ) possibly tune down the nsslapd-ioblocktimeout value Thanks, M.
On Fri, Oct 4, 2024 at 11:06 AM gwaters-web--- via 389-users 389-users@lists.fedoraproject.org wrote:
Hello, We are experiencing a new issue since we upgraded from 389-ds-base from 1.4~ish to 2.0.15 on RHEL 8. I couldnt figure how to fix it, so I switched to RHEL9 and are on 2.4.5-9. The issue is during a performance load test of a web application. The app logs into a website and does some things that searches against ldap, and does some transactions. This app has been performing fine for years, the app has changed so it could be something there, but I am not sure about that because of the percentage of the traffic that is successful. The errors for the web app are "Can't contact Ldap Server" and sometimes "Can't contact LDAP server. Start TLS request accepted.Server willing to negotiate SSL. (0xFFFF [-1])". Out of the 128k connections below, these errors will happen like 5 or 6 times, so its wildly inconsistent and random. I did a logconv analysis with 6 hours of a day of testing, see below. One thing that really stood out to me was the peak concurrent connections = 22.. That peak is so low, I dont know how these errors are happening. I dont see any errors in the access log ( grepping for err=1). I looked for cache warnings/errors in the access/errors logs, but didnt find any. I dont see things like unavailable connections in the access logs. Suggestions on what to change or look for in the logs ? Thanks, Gary information: Machine Size: 16G of ram, 4 core AMD (its an EC2.m5.large, gp3 disk type) kernel: Linux 5.14.0-427.35.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC packages: 389-ds-base-libs-2.4.5-9.el9_4.x86_64 389-ds-base-2.4.5-9.el9_4.x86_64 single instance of dirsrv running dirsrv modifcations from default: nsslapd-logging-backend: dirsrv-log,syslog nsslapd-maxdescriptors: 8192 nsslapd-listen-backlog-size: 256 nsslapd-allow-hashed-passwords: on nsslapd-idletimeout: 30 nsslapd-ioblocktimeout: 30000 nsslapd-sizelimit: -1 nsslapd-auditlog-logging-enabled: off nsslapd-lookthroughlimit: -1 dirsrv.systemd: limitNOFILE=8192 >Total Log Lines Analszed: 2694287 > > > > ---------- Access Log Output ------------ > > Start of Logs: 26/Sep/2024:10:07:32.089983378 > End of Logs: 26/Sep/2024:15:54:29.895403688 > > Processed Log Time: 5 Hours, 46 Minutes, 57.805426688 Seconds > > Restarts: 0 > Secure Protocol Versions: > - TLS1.2 128-bit AES-GCM (123117 connections) > > Peak Concurrent Connections: 22 > Total Operations: 1097043 > Total Results: 1097044 > Overall Performance: 100.0% > > Total Connections: 128646 (6.18/sec) (370.78/min) > - LDAP Connections: 128646 (6.18/sec) (370.78/min) > - LDAPI Connections: 0 (0.00/sec) (0.00/min) > - LDAPS Connections: 0 (0.00/sec) (0.00/min) > - StartTLS Extended Ops: 123116 (5.91/sec) (354.84/min) > > Searches: 845279 (40.60/sec) (2436.22/min) > Modifications: 0 (0.00/sec) (0.00/min) > Adds: 0 (0.00/sec) (0.00/min) > Deletes: 0 (0.00/sec) (0.00/min) > Mod RDNs: 0 (0.00/sec) (0.00/min) > Compares: 0 (0.00/sec) (0.00/min) > Binds: 128647 (6.18/sec) (370.78/min) > > Average wtime (wait time): 0.001560856 > Average optime (op time): 0.003310453 > Average etime (elapsed time): 0.004868040 > > Multi-factor Authentications: 0 > Proxied Auth Operations: 0 > Persistent Searches: 0 > Internal Operations: 0 > Entry Operations: 0 > Extended Operations: 123116 > Abandoned Requests: 1 > Smart Referrals Received: 0 > > VLV Operations: 0 > VLV Unindexed Searches: 0 > VLV Unindexed Components: 0 > SORT Operations: 0 > > Entire Search Base Queries: 0 > Paged Searches: 0 > Unindexed Searches: 0 > Unindexed Components: 0 > Invalid Attribute Filters: 0 > FDs Taken: 128646 > FDs Returned: 129318 > Highest FD Taken: 968 > > Broken Pipes: 0 > Connections Reset By Peer: 0 > Resource Unavailable: 0 > Max BER Size Exceeded: 0 > > Binds: 128647 > Unbinds: 119206 > ------------------------------------- > - LDAP v2 Binds: 0 > - LDAP v3 Binds: 128647 > - AUTOBINDs(LDAPI): 0 > - SSL Client Binds: 0 > - Failed SSL Client Binds: 0 > - SASL Binds: 0 > - Dir -- _______________________________________________ 389-users mailing list -- 389-users@lists.fedoraproject.org To unsubscribe send an email to 389-users-leave@lists.fedoraproject.org Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/ List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines List Archives: https://lists.fedoraproject.org/archives/list/389-users@lists.fedoraproject.org Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
Ah yes of course. Here is 1 run of their web app load test, it is 6 minutes long, and it should mostly be only the test it self. I will start looking for
We encountered 2 "Can not contact ldap server" errors during this run.
2 cant contact ldap server errors in this run below.
These errors are only shown on the client, yes? Is there any evidence of a failed connection in the access log?
After the run I bumped up these from 4096,
net.ipv4.tcp_max_syn_backlog = 6144 net.core.somaxconn = 6144
Yet we still get the ldap errors (this one and the start tls request error previously mentioned.)
Should I bump up the nsslapd-listen-backlog-size, net.ipv4.tcp_max_syn_backlog, net.core.somaxconn more ?
We encountered a similar issue recently with another load test, where the load tester wasn't averaging it's connections, it would launch 10,000 connections at once and hope they all worked. With your load test, is it actually spreading it's connections out, or is it bursting?
Hi William,
These errors are only shown on the client, yes? Is there any evidence of a failed connection in the access log?
Correct, those 2 different contacting ldap error issues. I have searched for various things in the logs, but I havent read it line by line. I dont see "err=1", no fd errors, or "Not listening for new connections - too many fds open".
We encountered a similar issue recently with another load test, where the load tester wasn't averaging it's connections, it would launch 10,000 connections at once and hope they all worked. With your load test, is it actually spreading it's connections out, or is it bursting?
It's a ramp up of 500 users logging in and starting their searches, the initial ramp up is 60 seconds, but the searches and login/logouts is over 6 minutes. I just spliced up the logs to see what that first minute was like:
Peak Concurrent Connections: 689 Total Operations: 18770 Total Results: 18769 Overall Performance: 100.0%
Total Connections: 2603 (21.66/sec) (1299.40/min) - LDAP Connections: 2603 (21.66/sec) (1299.40/min) - LDAPI Connections: 0 (0.00/sec) (0.00/min) - LDAPS Connections: 0 (0.00/sec) (0.00/min) - StartTLS Extended Ops: 2571 (21.39/sec) (1283.42/min)
Searches: 13596 (113.12/sec) (6787.01/min) Modifications: 0 (0.00/sec) (0.00/min) Adds: 0 (0.00/sec) (0.00/min) Deletes: 0 (0.00/sec) (0.00/min) Mod RDNs: 0 (0.00/sec) (0.00/min) Compares: 0 (0.00/sec) (0.00/min) Binds: 2603 (21.66/sec) (1299.40/min)
With these settings below, the test results are in, they still get 1 ldap error per test.
net.ipv4.tcp_max_syn_backlog = 8192
net.core.somaxconn = 8192
Suggestions ? Should I bump these up more ?
Thanks,
Gary
On 10/14/24 20:42, William Brown wrote:
Ah yes of course. Here is 1 run of their web app load test, it is 6 minutes long, and it should mostly be only the test it self. I will start looking for
We encountered 2 "Can not contact ldap server" errors during this run.
2 cant contact ldap server errors in this run below.
These errors are only shown on the client, yes? Is there any evidence of a failed connection in the access log?
After the run I bumped up these from 4096,
net.ipv4.tcp_max_syn_backlog = 6144 net.core.somaxconn = 6144
Yet we still get the ldap errors (this one and the start tls request error previously mentioned.)
Should I bump up the nsslapd-listen-backlog-size, net.ipv4.tcp_max_syn_backlog, net.core.somaxconn more ?
We encountered a similar issue recently with another load test, where the load tester wasn't averaging it's connections, it would launch 10,000 connections at once and hope they all worked. With your load test, is it actually spreading it's connections out, or is it bursting?
-- Sincerely,
William Brown
Senior Software Engineer, Identity and Access Management SUSE Labs, Australia
These errors are only shown on the client, yes? Is there any evidence of a failed connection in the access log?
Correct, those 2 different contacting ldap error issues. I have searched for various things in the logs, but I havent read it line by line. I dont see "err=1", no fd errors, or "Not listening for new connections - too many fds open".
So, that means the error is happening *before* 389-ds gets a chance to accept on the connection.
Are there any routers, middlewares, firewalls, idp's etc between the client/ldap server? Load balancer?
We encountered a similar issue recently with another load test, where the load tester wasn't averaging it's connections, it would launch 10,000 connections at once and hope they all worked. With your load test, is it actually spreading it's connections out, or is it bursting?
It's a ramp up of 500 users logging in and starting their searches, the initial ramp up is 60 seconds, but the searches and login/logouts is over 6 minutes. I just spliced up the logs to see what that first minute was like: Peak Concurrent Connections: 689 Total Operations: 18770 Total Results: 18769 Overall Performance: 100.0%
Total Connections: 2603 (21.66/sec) (1299.40/min)
- LDAP Connections: 2603 (21.66/sec) (1299.40/min)
- LDAPI Connections: 0 (0.00/sec) (0.00/min)
- LDAPS Connections: 0 (0.00/sec) (0.00/min)
- StartTLS Extended Ops: 2571 (21.39/sec) (1283.42/min)
Searches: 13596 (113.12/sec) (6787.01/min) Modifications: 0 (0.00/sec) (0.00/min) Adds: 0 (0.00/sec) (0.00/min) Deletes: 0 (0.00/sec) (0.00/min) Mod RDNs: 0 (0.00/sec) (0.00/min) Compares: 0 (0.00/sec) (0.00/min) Binds: 2603 (21.66/sec) (1299.40/min)
With these settings below, the test results are in, they still get 1 ldap error per test.
net.ipv4.tcp_max_syn_backlog = 8192
net.core.somaxconn = 8192
Suggestions ? Should I bump these up more ?
We still don't know what the cause *is* so just tweaking values won't help. We need to know what layer is triggering the error before we make changes.
Reading these numbers, this doesn't look like the server should be under any stress at all - I have tested with 2cpu / 4gb ram and can easily get 10,000 simultaneous connections launched and accepted by 389-ds.
My thinking at this point is there is something in between the client and 389 that is not coping.
On 10/16/24 2:26 AM, William Brown via 389-users wrote:
These errors are only shown on the client, yes? Is there any evidence of a failed connection in the access log?
Correct, those 2 different contacting ldap error issues. I have searched for various things in the logs, but I havent read it line by line. I dont see "err=1", no fd errors, or "Not listening for new connections - too many fds open".
So, that means the error is happening *before* 389-ds gets a chance to accept on the connection.
Are there any routers, middlewares, firewalls, idp's etc between the client/ldap server? Load balancer?
We encountered a similar issue recently with another load test, where the load tester wasn't averaging it's connections, it would launch 10,000 connections at once and hope they all worked. With your load test, is it actually spreading it's connections out, or is it bursting?
It's a ramp up of 500 users logging in and starting their searches, the initial ramp up is 60 seconds, but the searches and login/logouts is over 6 minutes. I just spliced up the logs to see what that first minute was like:
Peak Concurrent Connections: 689 Total Operations: 18770 Total Results: 18769 Overall Performance: 100.0%
Total Connections: 2603 (21.66/sec) (1299.40/min) - LDAP Connections: 2603 (21.66/sec) (1299.40/min) - LDAPI Connections: 0 (0.00/sec) (0.00/min) - LDAPS Connections: 0 (0.00/sec) (0.00/min) - StartTLS Extended Ops: 2571 (21.39/sec) (1283.42/min)
Searches: 13596 (113.12/sec) (6787.01/min) Modifications: 0 (0.00/sec) (0.00/min) Adds: 0 (0.00/sec) (0.00/min) Deletes: 0 (0.00/sec) (0.00/min) Mod RDNs: 0 (0.00/sec) (0.00/min) Compares: 0 (0.00/sec) (0.00/min) Binds: 2603 (21.66/sec) (1299.40/min)
With these settings below, the test results are in, they still get 1 ldap error per test.
Any chance that you can get a tcp-dump over the 6 minutes and try to find the syn without ack around the time of the failure ?
net.ipv4.tcp_max_syn_backlog = 8192
net.core.somaxconn = 8192
Suggestions ? Should I bump these up more ?
We still don't know what the cause *is* so just tweaking values won't help. We need to know what layer is triggering the error before we make changes.
Reading these numbers, this doesn't look like the server should be under any stress at all - I have tested with 2cpu / 4gb ram and can easily get 10,000 simultaneous connections launched and accepted by 389-ds.
My thinking at this point is there is something in between the client and 389 that is not coping.
-- Sincerely,
William Brown
Senior Software Engineer, Identity and Access Management SUSE Labs, Australia
Yes I can during the next round of testing. I ll see if I can see anything obvious in wireshark. I ll look for mis colored connections, right ? ( I have not looking for missing syn-acks before, and wanted to check).
-Gary
On 10/16/24 01:02, Thierry Bordaz wrote:
On 10/16/24 2:26 AM, William Brown via 389-users wrote:
These errors are only shown on the client, yes? Is there any evidence of a failed connection in the access log?
Correct, those 2 different contacting ldap error issues. I have searched for various things in the logs, but I havent read it line by line. I dont see "err=1", no fd errors, or "Not listening for new connections - too many fds open".
So, that means the error is happening *before* 389-ds gets a chance to accept on the connection.
Are there any routers, middlewares, firewalls, idp's etc between the client/ldap server? Load balancer?
We encountered a similar issue recently with another load test, where the load tester wasn't averaging it's connections, it would launch 10,000 connections at once and hope they all worked. With your load test, is it actually spreading it's connections out, or is it bursting?
It's a ramp up of 500 users logging in and starting their searches, the initial ramp up is 60 seconds, but the searches and login/logouts is over 6 minutes. I just spliced up the logs to see what that first minute was like:
Peak Concurrent Connections: 689 Total Operations: 18770 Total Results: 18769 Overall Performance: 100.0%
Total Connections: 2603 (21.66/sec) (1299.40/min) - LDAP Connections: 2603 (21.66/sec) (1299.40/min) - LDAPI Connections: 0 (0.00/sec) (0.00/min) - LDAPS Connections: 0 (0.00/sec) (0.00/min) - StartTLS Extended Ops: 2571 (21.39/sec) (1283.42/min)
Searches: 13596 (113.12/sec) (6787.01/min) Modifications: 0 (0.00/sec) (0.00/min) Adds: 0 (0.00/sec) (0.00/min) Deletes: 0 (0.00/sec) (0.00/min) Mod RDNs: 0 (0.00/sec) (0.00/min) Compares: 0 (0.00/sec) (0.00/min) Binds: 2603 (21.66/sec) (1299.40/min)
With these settings below, the test results are in, they still get 1 ldap error per test.
Any chance that you can get a tcp-dump over the 6 minutes and try to find the syn without ack around the time of the failure ?
net.ipv4.tcp_max_syn_backlog = 8192
net.core.somaxconn = 8192
Suggestions ? Should I bump these up more ?
We still don't know what the cause *is* so just tweaking values won't help. We need to know what layer is triggering the error before we make changes.
Reading these numbers, this doesn't look like the server should be under any stress at all - I have tested with 2cpu / 4gb ram and can easily get 10,000 simultaneous connections launched and accepted by 389-ds.
My thinking at this point is there is something in between the client and 389 that is not coping.
-- Sincerely,
William Brown
Senior Software Engineer, Identity and Access Management SUSE Labs, Australia
389-users@lists.fedoraproject.org