On Wed, Jun 24, 2015 at 10:18:26AM -0700, Janelle wrote:
> On 6/24/15 12:38 AM, Jakub Hrozek wrote:
>> On Tue, Jun 23, 2015 at 07:52:46AM -0700, Janelle wrote:
>>> On 6/23/15 7:33 AM, John Hodrien wrote:
>>>> On Tue, 23 Jun 2015, Janelle wrote:
>>>>
>>>>> Servers are behind a load-balancer. Address never changes.
>>>> But one problem with that is that SSSD will see multiple servers as one
>>>> server, and so will mark the server as failed if the load balancer
>>>> presents it
>>>> with a broken back end server.
>>>>
>>>> Works much better in my experience when you tell SSSD about all the
>>>> servers.
>>>>
>>>> jh
>>> Sadly that is not possible. If SSSD did load balancing when given multiple
>>> servers, then yes, but it does not. When you are running 30,000 servers with
>>> 3000 users, you have to load balance or SSSD simply dies and an ssh login
>>> takes 5 minutes to complete.
>> What is the configuration you were running here? I'm interested in
>> seeing how we can make SSSD not die :-)
>>
>>> The only way to make SSSD happy and not kill
>>> the single server it would point to is to have multiple servers behind a
>>> VIP.
>> Hmm, did you consider SRV records as John pointed out elsewhere? Then
>> you could load-balance using weight fields of SRV records..
>>
>>> Am I completely off base to think this is the way to go? Can SSSD be
>>> taught to actually load balance?
>> I'm not exactly sure how you would like SSSD to behave. Would this
>> ticket help -
https://fedorahosted.org/sssd/ticket/2499 ?
>> _______________________________________________
>> sssd-users mailing list
>> sssd-users(a)lists.fedorahosted.org
>>
https://lists.fedorahosted.org/mailman/listinfo/sssd-users
> What I found was that when the VIP servers are updated, even though most of
> the systems continue to run, a large population seems to say the LDAP server
Have you tried if cycling the offline/online status with USR1 and USR2
helps?
> has lost connection. And then SSSD stops trying unless you restart it:
>
> ldap_id_use_start_tls = falsessd[be[default]]] [fo_resolve_service_send]
> (0x0020): No available servers for service 'LDAP'
> [autofs]edentials = true5) [sssd[be[default]]]
> [sss_ldap_init_sys_connect_done] (0x0020): ldap_install_tls failed: Connect
> error
> ldap_tls_cacertdir = /etc/openldap/cacertst]]] [sdap_sys_connect_done]
> (0x0020): sdap_async_connect_call request failed.
>
> (ignore cert error - it is set to ALLOW)
>
> A simple "service sssd restart" solves it, but you can see the server is
> still up. A telnet connect to either of 389 or 636 works fine. It seems to
> me like SSSD just gives up and stops trying?
At that point sssd goes offline, right?
Could you try experimenting with a short offline_timeout? (see man
sssd.conf for more details on that option)
> As a side note - nslcd works flawlessly and the server might disconnect for
> a second, then it comes back and nslc restores the connect. It does not seem
> to give up as SSSD does :-(
I think it's because nslcd is not as stateful as sssd, so it would try
to connect every time. But I'm not totally sure without seeing the issue
myself..
_______________________________________________
sssd-users mailing list
sssd-users(a)lists.fedorahosted.org
https://lists.fedorahosted.org/mailman/listinfo/sssd-users What version was
offline_timeout added? I would expect with a default of
60, it would recover, but it does not seem to. But maybe there is a
version issue here?
~J