This gave us enough insight to track down the culprit.

BTW, it seems that RHEL7 sssd_amer.corp.com.log does not give sufficient detail to tell us it's sssd.nss service.  But RHEL8 or RHEL9 version of sssd gives us this detail.

So we find consistently, it's sudo.  Example:

(2023-10-04  9:00:01): [nss] [get_client_cred] (0x4000): Client [0x559189778c70][23] creds: euid[0] egid[0] pid[3587183] cmd_line['sudo'].
(2023-10-04  9:00:01): [nss] [setup_client_idle_timer] (0x4000): Idle timer re-set for client [0x559189778c70][23]
(2023-10-04  9:00:01): [nss] [accept_fd_handler] (0x0400): [CID#1754] Client [cmd sudo][uid 0][0x559189778c70][23] connected!

Suspecting a cron job somewhere, we were able to find this 

# cat /etc/cron.d/ma_healthcheck
0,30 * * * * root /opt/McAfee/agent/scripts/ma checkhealth >/dev/null 2>/dev/null

This is a cron job that comes as part of the McAfee agent (now called Trellix agent).  Sure enough, inside that script they are doing this:

 sudo -u mfe /bin/sh -c "LD_LIBRARY_PATH=/opt/McAfee/agent/lib/lib64:/opt/McAfee/agent/lib/lib64/tools:/opt/McAfee/agent/lib/lib64/rsdk:/opt/McAfee/agent/lib:/opt/McAfee/agent/lib/tools:/opt/McAfee/agent/lib/rsdk:$LD_LIBRARY_PATH ${PROGROOT}/bin/macmnsvc status" > /dev/null 2>&1

Where 'mfe' is a local account that gets created as part of this Trellix agent install.

Apparently, what sudo does is enumerate every group it finds in /etc/sudoers.d/*  to see if that particular user is a member of that group and has that associated sudo privilege.  On a particular server we might have 7 - 10 privileged groups.  For monitoring support, backup support. DBAs, etc.

For example, /etc/sudoers.d/puppet_client file might look like:

# Allows puppet users to do a puppet agent run.
Cmnd_Alias      PUPPET_APP_TEAMS =      /usr/local/bin/puppet agent -t,        \
                                        /usr/local/bin/puppet agent -t --noop

%pptuserpac                          ALL=NOPASSWD: PUPPET_APP_TEAMS

Apparently, sudo enumerates this pptuserpac AD group to see if mfe is a member, to see if that account has sudo privs to perform this operation.  (Of course, local account mfe is not a member of any AD group -- but sudo doesn't know that.)

Because the sssd cache times out and because McAfee has set up all its installs to run on the hour and half-hour, that's a thundering herd problem.  We could increase the sssd cache timeout, but that's just delaying the thundering herd.  Say we set the cache expiration to 2 1/2 hrs.  Then on the 3rd hr, it'd still be a thundering herd.

We see in other places in this McAfee script that they run this command using 'su' instead of 'sudo'.  

su -s /bin/sh -c "LD_LIBRARY_PATH=...  ${PROGROOT}/bin/macmnsvc status" mfe

Running this command via 'su' instead of 'sudo' would not trigger this thundering herd.  (We have verified that.)  Alternatively, randomizing their healthcheck execution times would avoid this thundering herd problem.

Anyway, it's McAfee's problem to fix now.  We'll report it and I'm sure they'll figure out a solution.

Spike White






On Wed, Oct 4, 2023 at 4:45 AM Alexey Tikhonov <atikhono@redhat.com> wrote:


On Wed, Oct 4, 2023 at 11:40 AM Alexey Tikhonov <atikhono@redhat.com> wrote:


On Tue, Oct 3, 2023 at 11:22 PM Spike White <spikewhitetx@gmail.com> wrote:
Alexey,

Yes I see that now.  That every time it starts a new LDAP connection, it starts by querying rootDSE.  So I have to look further in the logs.

I think I have discerned a pattern.  It appears that on each hour and half-hour, it's querying the members of the simple_allow_groups line.

A cron job run under a corresponding user?

In the domain log, next line after "Got request for" there should be line like
```
DP Request [Account #2]: REQ_TRACE: New request. [sssd.nss CID #1]
```
  --  this tells you what service ('sssd_nss' in this case) requested this lookup and what is request ID in its logs ('#1' in this case).

Next you can grep sssd_nss.log to figure out who app triggered this:
```
(2023-04-14 14:02:08): [nss] [get_client_cred] (0x4000): Client [0x55eb3ac27760][27] creds: euid[0] egid[0] pid[181089] cmd_line['id'].
(2023-04-14 14:02:08): [nss] [setup_client_idle_timer] (0x4000): Idle timer re-set for client [0x55eb3ac27760][27]
(2023-04-14 14:02:08): [nss] [accept_fd_handler] (0x0400): [CID#1] Client [cmd id][uid 0][0x55eb3ac27760][27] connected!
```
  --  'id' in this case.

You could also give a try `sssctl analyze request list` and `sssctl analyze --logdir . request show ...`

 

 
  I have examined this on 5 different servers in different geographical locations, it holds true for each server. 

for example, in /var/log/sssd dir:

# grep -A 4 'sbus_dispatch.*Dispatching.' sssd_amer.corp.com.log | grep 'name=' | grep BE_REQ_GROUP 

Here's the output.  Each ellipsis is 10 - 20 lines omitted that occurs in the same second.

(2023-10-03 10:07:50): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=oracle@amer.corp.com]
(2023-10-03 10:07:50): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=oracle@emea.corp.com]
...
(2023-10-03 10:30:02): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=zabbix-support@amer.corp.com]
(2023-10-03 10:30:03): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=emeastorageadmins@amer.corp.com]
(2023-10-03 10:30:03): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=apacstorageadmins@amer.corp.com]
(2023-10-03 10:30:03): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=apacstorageadmins@emea.corp.com]
(2023-10-03 10:30:03): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=amerlnxsvcdelauttfs@amer.corp.com]
(2023-10-03 10:30:03): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=amerlinuxengtfssupt@amer.corp.com]
(2023-10-03 10:30:03): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=amerlinuxeng@amer.corp.com]
...
(2023-10-03 10:30:03): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=oracle@apac.corp.com]
(2023-10-03 10:30:03): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=oracle@japn.corp.com]
(2023-10-03 10:30:03): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=oracle@corp.com]
(2023-10-03 11:00:01): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=pptadminspac@amer.corp.com]
(2023-10-03 11:00:01): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=unv_svcgrid_login@amer.corp.com]
(2023-10-03 11:00:01): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=gbl_storage_admins@amer.corp.com]
...
(2023-10-03 11:30:02): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=zabbix-support@amer.corp.com]
(2023-10-03 11:30:02): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=emeastorageadmins@amer.corp.com]
(2023-10-03 11:30:06): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=apacstorageadmins@amer.corp.com]
(2023-10-03 11:30:06): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=apacstorageadmins@emea.corp.com]
...
(2023-10-03 12:00:01): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=pptadminspac@amer.corp.com]
(2023-10-03 12:00:01): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=unv_svcgrid_login@amer.corp.com]
(2023-10-03 12:00:01): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=gbl_storage_admins@amer.corp.com]
(2023-10-03 12:00:01): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=emeastorageadmins@amer.corp.com]
(2023-10-03 12:00:06): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=emeastorageadmins@emea.corp.com]
...
(2023-10-03 12:30:02): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=zabbix-support@amer.corp.com]
(2023-10-03 12:30:02): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=emeastorageadmins@amer.corp.com]
(2023-10-03 12:30:04): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=apacstorageadmins@amer.corp.com]
...
(2023-10-03 12:41:16): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=emeastorageadmins@amer.corp.com]
(2023-10-03 12:41:16): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=apacstorageadmins@amer.corp.com]
(2023-10-03 12:41:16): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=apacstorageadmins@emea.corp.com]
...
(2023-10-03 12:49:01): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=oracle@amer.corp.com]
(2023-10-03 12:49:01): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=oracle@emea.corp.com]
(2023-10-03 12:49:01): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=oracle@apac.corp.com]
(2023-10-03 12:49:01): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=oracle@japn.corp.com]
(2023-10-03 12:49:01): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=oracle@corp.com]
...
(2023-10-03 13:00:01): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=pptadminspac@amer.corp.com]
(2023-10-03 13:00:01): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=unv_svcgrid_login@amer.corp.com]
(2023-10-03 13:00:01): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=gbl_storage_admins@amer.corp.com]
...
(2023-10-03 13:26:18): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=zabbix-support@amer.corp.com]
(2023-10-03 13:26:18): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=emeastorageadmins@amer.corp.com]
...
(2023-10-03 13:30:02): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=apaclinuxsup@amer.corp.com]
(2023-10-03 13:30:02): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=apaclinuxsup@emea.corp.com]
(2023-10-03 13:30:02): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=apaclinuxeng@amer.corp.com]
(2023-10-03 13:30:02): [be[amer.corp.com]] [dp_get_account_info_send] (0x0200): Got request for [0x2][BE_REQ_GROUP][name=apaclinuxeng@emea.corp.com]
...
and it continues on, each and every half-hour.

So it appears that something is waking up every half-hour and validating memberships in the simple_allow_groups.  I don't claim that's all that's being performed on this half-hour wake-up, but it is clear this is occuring.

Different servers have different simple_allow_group memberships;  it's always the memberships for that specific server that's being queried.

Spike


On Mon, Oct 2, 2023 at 12:23 PM Alexey Tikhonov <atikhono@redhat.com> wrote:
On Mon, Oct 2, 2023 at 7:01 PM Spike White <spikewhitetx@gmail.com> wrote:
>
> So the idea to turn on debug_level = 9 on the client and view the logs was inspired.  We turned on debug level 9 on 4 clients;
>
> 2 in the list (that we got from AD team of servers in that AMERAustin site hitting the non-AMER Austin AD DCs).
> 2 not in their list.  (1 in another AMER site).
>
> Consistently, we see them querying the rootDSE for all these domains on the hour and the half-hour.  Querying the local AMER rootDSE in Austin is not a problem;  they have beaucoup AD DCs in Austin.  Querying the other domains' rootDSEs in Austin is a problem;  they typically only have 2 AD DCs from each region.
>
> Here’s an example from the client logs.  First client:
>
>
>
> (2023-10-03  0:30:06): [be[amer.corp.com]] [sdap_ldap_connect_callback_add] (0x1000): New LDAP connection to [ldap://ausdc16corp05.corp.com:389/??base] with fd [30].
>
> (2023-10-03  0:30:06): [be[amer.corp.com]] [sdap_get_rootdse_send] (0x4000): Getting rootdse
>
> …
>
> (2023-10-03  0:41:18): [be[amer.corp.com]] [sdap_ldap_connect_callback_add] (0x1000): New LDAP connection to [ldap://AUSDC16ROAMER01.amer.corp.com:389/??base] with fd [20].
>
> (2023-10-03  0:41:18): [be[amer.corp.com]] [sdap_get_rootdse_send] (0x4000): Getting rootdse
>
>
>
> Another client:
>
>
>
> (2023-10-02 11:30:02): [be[amer.corp.com]] [sdap_ldap_connect_callback_add] (0x4000): [RID#48] New connection to [ldap://ausdc16amer33.amer.corp.com:3268/??base] with fd [25]
>
> (2023-10-02 11:30:02): [be[amer.corp.com]] [sdap_get_rootdse_send] (0x4000): [RID#48] Getting rootdse
>

What goes next for this 'RID#48', after 'rootdse' is read?

SSSD doesn't query 'rootdse' on its own. It is being read (as a first
operation) when a new connection is established.
You need to see what happens next to figure out *why* this new
connection is established.

As a practical side note, you can also increase
`ldap_connection_expire_timeout` to keep connections longer.
But I would figure out the reason first.


> --
>
> (2023-10-02 11:30:04): [be[amer.corp.com]] [sdap_ldap_connect_callback_add] (0x4000): [RID#49] New connection to [ldap://ausdc16emea05.emea.corp.com:389/??base] with fd [26]
>
> (2023-10-02 11:30:04): [be[amer.corp.com]] [sdap_get_rootdse_send] (0x4000): [RID#49] Getting rootdse
>
> --
>
> (2023-10-02 11:30:05): [be[amer.corp.com]] [sdap_ldap_connect_callback_add] (0x4000): [RID#50] New connection to [ldap://ausdc16apac06.apac.corp.com:389/??base] with fd [27]
>
> (2023-10-02 11:30:05): [be[amer.corp.com]] [sdap_get_rootdse_send] (0x4000): [RID#50] Getting rootdse
>
> --
>
> (2023-10-02 11:30:05): [be[amer.corp.com]] [sdap_ldap_connect_callback_add] (0x4000): [RID#51] New connection to [ldap://AUSDC16JAPN02.japn.corp.com:389/??base] with fd [28]
>
> (2023-10-02 11:30:05): [be[amer.corp.com]] [sdap_get_rootdse_send] (0x4000): [RID#51] Getting rootdse
>
> --
>
> (2023-10-02 11:32:52): [be[amer.corp.com]] [sdap_ldap_connect_callback_add] (0x4000): [RID#84] New connection to [ldap://ausdc16emea05.emea.corp.com:389/??base] with fd [26]
>
> (2023-10-02 11:32:52): [be[amer.corp.com]] [sdap_get_rootdse_send] (0x4000): [RID#84] Getting rootdse
>
> --
>
>
>
> BTW, this seems to occurs on both RHEL7 and RHEL8.  (Haven't looked at our RHEL9 builds yet).   It's occurring on all servers to all rootDSEs, but only a problem for AMERAustin, since Austin is such a heavily-populated.
>
>
> These rootDSEs change almost never.  Any way to have it query not as frequently, or randomize when servers query these rootDSEs.
>
>
> Spike
>
> On Mon, Oct 2, 2023 at 2:37 AM Alexey Tikhonov <atikhono@redhat.com> wrote:
>>
>> Hi,
>>
>> On Mon, Oct 2, 2023 at 6:20 AM Spike White <spikewhitetx@gmail.com> wrote:
>>>
>>> All,
>>>
>>> Is there anything in sssd's RHEL and RHEL-like Linux server OS settings that perform LDAP binds or connections to AD every 30 minutes?
>>>
>>> What our AD team is seeing is all of the DCs in our biggest AMER AD site peak with LDAP sessions for about 10 minutes at the top of the hour then again at the bottom of the hour.  No other AD site in the world appears to see this behavior not even other AD sites in this metro area.
>>>
>>> The reason they noticed is that our non-amer DCs in this biggest AD site hit their 5k LDAP client session limit during those 10 minutes every 30 minutes.  Meaning any clients attempting to establish a LDAP session past 5000 are dropped by the DC.  In their research they see thousands LDAP Binds by RHEL Linux servers against two specific non-AMER AD DCs in a short period of time after digging through some LDAP log samples that they pulled from these DCs.
>>
>>
>> Can they also say what operations are being performed by those connections?
>> Or can you check SSSD logs on the client side?
>>
>> I wonder if this could be `ldap_sudo_smart_refresh_interval`...
>>
>>
>>>
>>>
>>> In this major AD sites, we have dozens and dozens of AMER AD DCs.  So there's enough preferred AD DCs to spread the load.  But typically for the non-AMER regions, the AD team puts 2 of each regions DCs in a site.  For instance, for APAC they would be put two APAC DCs in this AMER major site.  Thus all AMER RHEL servers in this site would randomly hit dozens of AMER DCs, but concentrate on these two preferred APAC DCs.  (preferred because they're in this locatiion).
>>>
>>> I know our older AD integration product used to hit AD every 30 mins to check GPOs, but we're not implementing GPOs with sssd.
>>>
>>> Spike
>>> _______________________________________________
>>> sssd-users mailing list -- sssd-users@lists.fedorahosted.org
>>> To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org
>>> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>>> List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.org
>>> Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
>>
>> _______________________________________________
>> sssd-users mailing list -- sssd-users@lists.fedorahosted.org
>> To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org
>> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
>> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
>> List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.org
>> Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
>
> _______________________________________________
> sssd-users mailing list -- sssd-users@lists.fedorahosted.org
> To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org
> Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
> List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
> List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.org
> Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
_______________________________________________
sssd-users mailing list -- sssd-users@lists.fedorahosted.org
To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.org
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
_______________________________________________
sssd-users mailing list -- sssd-users@lists.fedorahosted.org
To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.org
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue
_______________________________________________
sssd-users mailing list -- sssd-users@lists.fedorahosted.org
To unsubscribe send an email to sssd-users-leave@lists.fedorahosted.org
Fedora Code of Conduct: https://docs.fedoraproject.org/en-US/project/code-of-conduct/
List Guidelines: https://fedoraproject.org/wiki/Mailing_list_guidelines
List Archives: https://lists.fedorahosted.org/archives/list/sssd-users@lists.fedorahosted.org
Do not reply to spam, report it: https://pagure.io/fedora-infrastructure/new_issue