Hi Eugene, I decided to start a new thread to discuss so that we can close the previous parenthesis and concentrate on the problem at hand.
On Mon, 19 Apr 2010 15:19:22 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
So now we are ready to continue with fixing failover reconnect and GSSAPI authentication in LDAP and IPA providers. From my point of view at least the following problems needs to be addressed by final solution:
- When two (or more) BE requests are executed in parallel and there
is no cached connection, only one LDAP connection should be established. In current implementation 2 connections will be established and the first one killed failing the operation that connected first.
ACK (within the boundaries of the ID provider)
- When OFFLINE state is detected during request execution (there
were cached connection, but all failover servers failed to connect during request execution), the backend must return DP_ERR_OFFLINE. It currently returns DP_ERR_FATAL with EIO error. Next request completes with DP_ERR_OFFLINE. So there is a big inconsistency in behaviour.
I think this makes sense.
- It is essential to close LDAP connection before GSSAPI ticket is
expired as closing connection with already expired ticket still writes a message in message log.
Premise: I have started a discussion upstream wrt killing GSSAPI connection when credentials expire. Heimdal doesn't do that. MIT does, but things may change.
Until the issue is resolved upstream I think it makes sense to avoid bad messages in the logs, but only as long as avoid them doesn't require complex and convoluted code.
- The about-to-expire connection should be closed gracefully: all
requests already in progress and using the connection should be completed, new requests should establish and use new connection.
Hopefully we can avoid "expiring" connections (see premise above), but I think we need to be even more aggressive, and close connections when they go idle. This way we can free server resources and in most cases we will close much before we even get close to expiration time.
- ipa_access backend should also use failover retries.
ACK
- I think it is essential to reduce amount of copy-paste code
handling LDAP connect/reconnect code. My strong opinion is that a special mechanism for handling LDAP connect/retry logic is required.
If we need it then we need it at a deep level, down close to the openldap library boundary, so that we do not have to restart functions at a higher level. As close as possible to the wire.
Simo.
On 04/29/2010 10:07 PM, Simo Sorce wrote:
Hi Eugene, I decided to start a new thread to discuss so that we can close the previous parenthesis and concentrate on the problem at hand.
On Mon, 19 Apr 2010 15:19:22 +0400 Eugene Indenbomeindenbom@gmail.com wrote:
So now we are ready to continue with fixing failover reconnect and GSSAPI authentication in LDAP and IPA providers. From my point of view at least the following problems needs to be addressed by final solution:
- When two (or more) BE requests are executed in parallel and there
is no cached connection, only one LDAP connection should be established. In current implementation 2 connections will be established and the first one killed failing the operation that connected first.
ACK (within the boundaries of the ID provider)
- When OFFLINE state is detected during request execution (there
were cached connection, but all failover servers failed to connect during request execution), the backend must return DP_ERR_OFFLINE. It currently returns DP_ERR_FATAL with EIO error. Next request completes with DP_ERR_OFFLINE. So there is a big inconsistency in behaviour.
I think this makes sense.
- It is essential to close LDAP connection before GSSAPI ticket is
expired as closing connection with already expired ticket still writes a message in message log.
Premise: I have started a discussion upstream wrt killing GSSAPI connection when credentials expire. Heimdal doesn't do that. MIT does, but things may change.
Until the issue is resolved upstream I think it makes sense to avoid bad messages in the logs, but only as long as avoid them doesn't require complex and convoluted code.
- The about-to-expire connection should be closed gracefully: all
requests already in progress and using the connection should be completed, new requests should establish and use new connection.
Hopefully we can avoid "expiring" connections (see premise above), but I think we need to be even more aggressive, and close connections when they go idle. This way we can free server resources and in most cases we will close much before we even get close to expiration time.
That's really does not sound as a good idea for me: - A full connection cycle takes up to 1 sec, causing huge delays on first response - It might make sense to drop cached connection only after at least an hour of idle time - If we do drop cached connection we should optimize kerberos auth not to run ldap_child until cached ticket expire. This would reduce initial latency tenfold. - I do not see a big load on LDAP server if every workstation would keep a single LDAP connection. The default configuration of 389-ds would handle 8000 workstations. Looks quite enough.
- ipa_access backend should also use failover retries.
ACK
- I think it is essential to reduce amount of copy-paste code
handling LDAP connect/reconnect code. My strong opinion is that a special mechanism for handling LDAP connect/retry logic is required.
If we need it then we need it at a deep level, down close to the openldap library boundary, so that we do not have to restart functions at a higher level. As close as possible to the wire.
That would be really hard to achieve as: 1. Backend offline state is not known to sdap_handle, but is takes part in reconnect logic 2. The sdap_handle tear is already overcomplicated and adding reconnect logic there would make it unmanageable. 3. It is still important to have an object representing an instance of connection to LDAP server, you just could not put all eggs in one basket.
Eugene
PS I still do not understand what is wrong with my patches. Why is it not possible just to use them and not to redo the job?
On Fri, 30 Apr 2010 11:49:57 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
PS I still do not understand what is wrong with my patches. Why is it not possible just to use them and not to redo the job?
To be honest, I think it is too complex.
Simo.
On Fri, 30 Apr 2010 11:49:57 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
That would be really hard to achieve as:
- Backend offline state is not known to sdap_handle, but is takes
part in reconnect logic 2. The sdap_handle tear is already overcomplicated and adding reconnect logic there would make it unmanageable. 3. It is still important to have an object representing an instance of connection to LDAP server, you just could not put all eggs in one basket.
The alternative is to put the layer very up to the top, in data_provider_be. It requires everything underneath to properly return "retry-able errors".
Simo.
On Fri, 30 Apr 2010 11:49:57 +0400 Eugene Indenbom eindenbom@gmail.com wrote:
Hopefully we can avoid "expiring" connections (see premise above), but I think we need to be even more aggressive, and close connections when they go idle. This way we can free server resources and in most cases we will close much before we even get close to expiration time.
That's really does not sound as a good idea for me:
- A full connection cycle takes up to 1 sec, causing huge delays on
first response
Well this happens only after a long idle period, I don't think it is such a bad thing. Also remember that in many cases we will run operations as side refreshes (when midway cache refreshes are enabled), so the client will not actually hang as it will get back cached information.
- It might make sense to drop cached connection only after at least
an hour of idle time
I was thinking more around 10/15 min. but this should be a tunable option with good defaults and the ability to set it to 0 meaning no idle drops.
- If we do drop cached connection we should optimize kerberos auth
not to run ldap_child until cached ticket expire. This would reduce initial latency tenfold.
ACK.
- I do not see a big load on LDAP server if every workstation would
keep a single LDAP connection. The default configuration of 389-ds would handle 8000 workstations. Looks quite enough.
389-ds does, but not all ldap servers may be so happy :)
Simo.
sssd-devel@lists.fedorahosted.org