Hi all,
Let me introduce the symptoms that triggered this investigation.
First, versions:
- OS: Ubuntu 20.04
- SSSD: 2.2.3-3ubuntu0.7
- Platform: x86_64
Some weeks after starting using a new region/provider we noticed some intermittent issue
impacting SSH logins (ie: clients got `Connection closed by UNKNOWN port 65535` on `ssh
...`). When looking into it, I was able to reproduce it somewhat reliably. It happens
after some time of inactivity for that user, but if I constantly issue commands (ie:
`watch sssctl user-checks <user>` or `watch ssh <host> whoami` from a client)
the issue wont trigger. If I stopped trying out those commands and tried again ~30 minutes
later, it would fail.
With these symptoms I had these possibilities in mind: some caching issue, some
authentication expiring, connections pools issues...
I enabled detailed logging and compared logs from a successful request and a failed one
and I found what seems to be the source of the error from sssd logs perspective, but still
can't make much sense of it.
Log extracted from sssd_<domain>.log:
```
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [ipa_hbac_rule_info_next] (0x0400):
Sending request for next search base: [cn=hbac,dc=ipa,dc=corp,<dc
<domain>>][2][(&(objectclass=ipaHBACRule)(ipaenabledflag=TRUE)(accessRuleType=allow)(|(hostCategory=all)(...)))]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_print_server] (0x2000): Searching
<ip10.1.10.102:389
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x0400):
calling ldap_search_ext with
[(&(objectclass=ipaHBACRule)(ipaenabledflag=TRUE)(accessRuleType=allow)(|(hostCategory=all)(...)][cn=hbac,dc=ipa,dc=corp,<dc
<domain>>].
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x1000):
Requesting attrs: [objectclass]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x1000):
Requesting attrs: [cn]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x1000):
Requesting attrs: [ipauniqueid]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x1000):
Requesting attrs: [ipaenabledflag]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x1000):
Requesting attrs: [accessRuleType]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x1000):
Requesting attrs: [memberUser]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x1000):
Requesting attrs: [userCategory]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x1000):
Requesting attrs: [memberService]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x1000):
Requesting attrs: [serviceCategory]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x1000):
Requesting attrs: [sourceHost]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x1000):
Requesting attrs: [sourceHostCategory]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x1000):
Requesting attrs: [externalHost]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x1000):
Requesting attrs: [memberHost]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x1000):
Requesting attrs: [hostCategory]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_get_generic_ext_step] (0x2000):
ldap_search_ext called, msgid = 18
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_op_add] (0x2000): New operation 18
timeout 60
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_process_result] (0x2000): Trace:
sh[0x55d349f96d00], connected[1], ops[0x55d349ff16e0], ldap[0x55d34a10cba0]
(Wed Oct 13 19:40:52 2021) [be[<domain>]] [sdap_process_result] (0x2000): Trace: end
of ldap_result list
(Wed Oct 13 19:41:52 2021) [be[<domain>]] [sdap_op_timeout] (0x1000): Issuing
timeout for 18
(Wed Oct 13 19:41:52 2021) [be[<domain>]] [sdap_op_destructor] (0x1000): Abandoning
operation 18
(Wed Oct 13 19:41:52 2021) [be[<domain>]] [generic_ext_search_handler] (0x0040):
sdap_get_generic_ext_recv failed [110]: Connection timed out
(Wed Oct 13 19:41:52 2021) [be[<domain>]] [ipa_hbac_rule_info_done] (0x0080): Could
not retrieve HBAC rules
(Wed Oct 13 19:41:52 2021) [be[<domain>]] [ipa_pam_access_handler_done] (0x0020):
Unable to fetch HBAC rules [110]: Connection timed out
(Wed Oct 13 19:41:52 2021) [be[<domain>]] [dp_req_done] (0x0400): DP Request [PAM
Account #21]: Request handler finished [0]: Success
```
I tried using `ldap_enumeration_search_timeout = 180` which updated the timeout in
(`[sdap_op_add] (0x2000): New operation 18 timeout 60 `) but same result (`enumerate` and
`subdomain_enumerate` config options are left to their defaults, `false` and `none`
respectively).
I looked at the IPA servers LDAP logs and saw now related requests regarding
`ipaHBACRules` so I assume the request didn't even left the server.
I tried tweaking some other caching options like `cached_auth_timeout` and
`pam_id_timeout` without any impact. Also tried clearing up the caches with `sss_cache -E`
or `systemctl stop sssd && rm /var/lib/sss/db/* && systemctl start sssd`
and nothing changed.
This is the error returned by `sssctl user-checks` when it fails:
```
$ sudo sssctl user-checks <user>
user: <user>
action: acct
service: system-auth
SSSD nss user lookup result:
- user name: <user>
- user id: <uid>
- group id: <gid>
- gecos: <name>
- home directory: /home/<user>
- shell: /bin/bash
InfoPipe operation failed. Check that SSSD is running and the InfoPipe responder is
enabled. Make sure 'ifp' is listed in the 'services' option in
sssd.conf.InfoPipe User lookup with [<user>] failed.
testing pam_acct_mgmt
pam_acct_mgmt: System error
PAM Environment:
- no env -
```
Any pointer here would be very useful.
Thank you!
Aitor
Show replies by date