Hi,
after running some time with debugging enabled, I came across a "Too many open
files" error in the logs.
Shouldn't this be fixed? (
https://fedorahosted.org/sssd/ticket/2792)
We have 6 AD servers in our environment which get returned when using DNS SRV records. It
seems like one can't be discoverd in a timely manner:
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [sdap_id_op_connect_step] (0x4000):
beginning to connect
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [fo_resolve_service_send] (0x0100):
Trying to resolve service 'AD_GC'
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [get_port_status] (0x1000): Port
status of port 0 for server '(no name)' is 'not working'
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [get_port_status] (0x0100):
Reseting the status of port 0 for server '(no name)'
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]]
[fo_resolve_service_activate_timeout] (0x2000): Resolve timeout set to 6 seconds
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [get_srv_data_status] (0x0400):
Changing state of SRV lookup from 'SRV_RESOLVE_ERROR' to 'SRV_NEUTRAL'.
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [resolve_srv_send] (0x0200): The
status of SRV lookup is neutral
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [ad_srv_plugin_send] (0x0400):
About to find domain controllers
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [ad_get_dc_servers_send] (0x0400):
Looking up domain controllers in domain
some.domain.com
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [resolv_discover_srv_next_domain]
(0x0400): SRV resolution of service 'ldap'. Will use DNS discovery domain
'some.domain.com'
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [resolv_getsrv_send] (0x0100):
Trying to resolve SRV record of '_ldap._tcp.some.domain.com'
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [schedule_request_timeout]
(0x2000): Scheduling a timeout of 6 seconds
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [schedule_timeout_watcher]
(0x2000): Scheduling DNS timeout watcher
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [unschedule_timeout_watcher]
(0x4000): Unscheduling DNS timeout watcher
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [resolv_getsrv_done] (0x1000):
Using TTL [600]
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [request_watch_destructor]
(0x0400): Deleting request watch
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [fo_discover_srv_done] (0x0400):
Got answer. Processing...
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [fo_discover_srv_done] (0x0400):
Got 6 servers
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [ad_get_dc_servers_done] (0x0400):
Found 6 domain controllers in domain
some.domain.com
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [ad_srv_plugin_dcs_done] (0x0400):
About to locate suitable site
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [sdap_connect_host_send] (0x0400):
Resolving host
DC01.SOME.Domain.com
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [resolv_is_address] (0x4000):
[
DC01.SOME.Domain.com] does not look like an IP address
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [resolv_gethostbyname_step]
(0x2000): Querying files
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [resolv_gethostbyname_files_send]
(0x0100): Trying to resolve A record of 'DC01.SOME.Domain.com' in files
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [resolv_gethostbyname_step]
(0x2000): Querying files
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [resolv_gethostbyname_files_send]
(0x0100): Trying to resolve AAAA record of 'DC01.SOME.Domain.com' in files
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [resolv_gethostbyname_next]
(0x0200): No more address families to retry
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [resolv_gethostbyname_step]
(0x2000): Querying DNS
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [resolv_gethostbyname_dns_query]
(0x0100): Trying to resolve A record of 'DC01.SOME.Domain.com' in DNS
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [schedule_request_timeout]
(0x2000): Scheduling a timeout of 6 seconds
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [schedule_timeout_watcher]
(0x2000): Scheduling DNS timeout watcher
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [unschedule_timeout_watcher]
(0x4000): Unscheduling DNS timeout watcher
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [resolv_gethostbyname_dns_parse]
(0x1000): Parsing an A reply
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [request_watch_destructor]
(0x0400): Deleting request watch
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [sdap_connect_host_resolv_done]
(0x0400): Connecting to ldap://DC01.SOME.Domain.com:389
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [sss_ldap_init_send] (0x4000):
Using file descriptor [1015] for LDAP connection.
(Sat Mar 26 10:17:01 2016) [sssd[be[some.domain.com]]] [sss_ldap_init_send] (0x0400):
Setting 6 seconds timeout for connecting
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [sbus_dispatch] (0x4000): dbus
conn: 0x1741990
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [sbus_dispatch] (0x4000):
Dispatching.
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [sbus_message_handler] (0x4000):
Received SBUS method [ping]
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [sbus_get_sender_id_send] (0x2000):
Not a sysbus message, quit
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [sbus_handler_got_caller_id]
(0x4000): Received SBUS method [ping]
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [fo_resolve_service_timeout]
(0x0080): Service resolving timeout reached
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [sdap_handle_release] (0x2000):
Trace: sh[0x3742310], connected[0], ops[(nil)], ldap[(nil)], destructor_lock[0],
release_memory[0]
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [be_resolve_server_done] (0x1000):
Server resolution failed: 14
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [sdap_id_op_connect_done] (0x0400):
Failed to connect to server, but ignore mark offline is enabled.
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [sdap_id_op_connect_done] (0x4000):
notify offline to op #1
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [sdap_id_op_connect_step] (0x4000):
beginning to connect
This occurs repeatedly until sssd runs out file descriptors:
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [sdap_kinit_kdc_resolved] (0x1000):
KDC resolved, attempting to get TGT...
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [create_tgt_req_send_buffer]
(0x0400): buffer size: 54
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [sdap_fork_child] (0x0020): pipe
failed [24][Too many open files].
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [sdap_get_tgt_send] (0x0020):
sdap_fork_child failed.
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [sdap_kinit_done] (0x0020): child
failed (24 [Too many open files])
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [sdap_cli_kinit_done] (0x0400):
Cannot get a TGT: ret [24](Too many open files)
(Sat Mar 26 10:17:07 2016) [sssd[be[some.domain.com]]] [fo_set_port_status] (0x0100):
Marking port 389 of server 'DC02.SOME.Domain.com' as 'not working'
which then results in:
Mar 29 08:25:51 <HOSTNAME> sshd[6975]: pam_sss(sshd:auth): authentication failure;
logname= uid=0 euid=0 tty=ssh ruser= rhost=<rhost> user=<user>
Mar 29 08:25:51 <HOSTNAME> sshd[6975]: pam_sss(sshd:auth): received for user
<user>: 4 (System error)
Mar 29 08:25:53 <HOSTNAME> sshd[6975]: Failed password for <user> from
<ip> port 54001 ssh2
I've now tried setting dns_resolver_timeout to 10 seconds as someone mentioned at
https://fedorahosted.org/sssd/ticket/2792 as a possible workaround but it would be much
appreciated if someone could provide some feedback on this issue.
Many Thanks,
Christoph