Hi,
we might be able to do some refactoring in the next couple of weeks prior to working on the next release.
I wrote up a proposal here: https://fedorahosted.org/sssd/wiki/DesignDocs/OneFifteenCodeRefactoring for your convenience, I copied the contents inline as well.
But the page is really meant to kick off the discussion, so feel free to add suggestions or push back against those that are included.
The main points to keep in mind are that the work should be doable in a couple of weeks along with our regular maintenance work and should benefit SSSD, not just 'this code is ugly, let's change it' kind of work.
I'm looking forward to the discussion, especially I would like to hear Michal's or Lukas' opinion about the memory cache changes..
The text of the page follows: = Code refactoring for the 1.15 release =
Related ticket(s): * please see inline
== Problem statement == SSSD is being very actively developed, adding several major features in each release. We need to make sure the code stays maintainable and adding new features in the upcoming release won't increase the cost of maintaining SSSD long-term.
Since SSSD releases are primarily being driven by Fedora and RHEL releases, the Red Hat employed developers have a fixed amount of time for code refactoring. Of course, community members and developers are free to submit their patches on their schedule -- although discussion on the list would be needed prior to merging any refactoring to not disrupt SSSD release quality for everyone.
== Use cases == A typical use-case would be: "A feature X depends on module Y that either is missing some functionality that is missing or a module that has outlived its initial design. Changing Y in that module would allow us to implement X more easily or with less maintenance effort in the future".
The goal is to prepare the code for upcoming features without regressing, so testing after the refactoring is done is mandatory. We should consider also doing an upstream (pre)release to make it easier to test the changes.
== Proposed items to be refactored == This section lists the proposed tickets along with justifications, scope and test impact.
Given the fixed amount of time, each refactoring has a scope, expressed in just three high-level buckets - large (a couple of weeks, might take most of the time of the refactoring "sprint"), medium (a week to two weeks) or small (a couple of days, up to a week). Each item also lists the affected modules or functionality, so that we know where we need to improve tests.
=== Use the shared `cache_req` module for responder look-ups === Currently each responder (except the !InfoPipe responder and several parts of the NSS responder) copy the logic that checks the cache and contacts the Data Provider if needed. In 1.15, we should add the missing functionality into the cache_req module and convert the existing responders (especially those that look up users and groups, not necessarily other objects like autofs maps or hosts..) to cache_req.
==== Benefit to SSSD ==== In 1.15, we should look at allowing lookups from trusted domain with a shortname. But we need to take performance into account and avoid cycling over all domains including their LDAP server. Then we could switch to checking the caches of all domains first before checking each domain's cache and then its server.
This goal is tracked by https://fedorahosted.org/sssd/ticket/843 (Login time increases strongly if more than one domain is configured) and ultimately by https://fedorahosted.org/sssd/ticket/3001 ([RFE] Short name input format with SSSD for users from all domains when domain autodiscovery is used).
==== Tracking tickets ==== * https://fedorahosted.org/sssd/ticket/3151 - cache_req: complete the needs of NSS responders * https://fedorahosted.org/sssd/ticket/1126 - Reuse cache_req() in responder code
==== Testing ==== We already have NSS and PAM responder tests. We need to extend them further to make sure all codepaths we change are tested.
==== Scope ==== Large
=== Refactor group lookups for better performance === The `sdap_async_groups.c` module grew organically over time. At the moment, the module is quite hard to read and repeats some potentially expensive operations (like looping over all attributes or all members) several times.
In order to improve performance, we should refactor this module and test it extensively.
==== Benefit to SSSD ==== The `sdap_async_groups.c` module would be better maintainable and we would remove some performance bottlenecks from the code.
==== Tracking tickets ==== * https://fedorahosted.org/sssd/ticket/3211 - Refactor the sdap_async_groups.c module
==== Testing ==== LDAP group lookups can be tested using integration tests, "just" all cases we change must have corresponding test cases.
==== Scope ==== Medium
=== Refactor the sdap_id_ops.c module === The `sdap_id_ops.c` module was written in time where SSSD only supported a single domain. One of the things that are repeatedly biting us is that the module can set the fail over status of the whole domain to offline. Moreover, the module has no tests and is not easy to read.
At this time, it's not clear whether the refactoring would just result in documenting and testing the module or if it would be worth for example making the module return error codes for connection errors and let the caller handle the errors. Alternatively, we might decide to do even more work and let the fail over code work per-domain, not per-back end, which probably wouldn't be doable in the given scope. More research is needed.
==== Benefit to SSSD ==== The module would be better maintainable (currently there are some codepaths where we even don't know why they were added anymore..), have tests and we would work towards removing issues with trusted domains setting SSSD to the offline mode.
==== Tracking tickets ==== * https://fedorahosted.org/sssd/ticket/1507 - Investigate terminating connections in sdap_ops.c and comment the code some more Other related tickets include: * https://fedorahosted.org/sssd/ticket/2767 - The sdap_op code always ends request with EAGAIN * https://fedorahosted.org/sssd/ticket/2976 - sdap code can mark the whole sssd_be offline
==== Testing ==== Currently upstream has only basic tests with the integration tests. Downstream has tests for fail over as well.
==== Scope ==== Medium to large, depending on what changes we decide to do.
=== Provide a way to pass intermediate data between requests === As long as a request is confined within a single domain, we can pass around `sysdb_attrs` or a similar data structure between different requests and avoid a costly cache writes. However, when a request must include processing in two different domain types, for example an IPA domain that includes overrides, the only way to pass intermediate data is to call a sysdb transaction and save the data to cache so that another request can read them.
==== Benefit to SSSD ==== Performance benefit in case SSSD must call identity lookup requests from different domains.
==== Tracking tickets ==== * https://fedorahosted.org/sssd/ticket/2943 - Split out the requests for IPA users and groups that include overrides into reusable requests
==== Testing ==== Unfortunately, there are no upstream tests for requests that include overrides. Testing would be provided by downstream tests.
==== Scope ==== Medium to large
=== Upstream the PoC tests that utilize Samba AD for AD provider testing === At the moment, we don't have any upstream tests for the AD provider and we rely on downstream and manual testing completely. Nikolai Kondrashov wrote a proof-of-concept patches that provisions an AD DC server provided by the Samba project using the cwrap wrapper libraries. The scope of this effort would be to upstream this work.
==== Benefit to SSSD ==== SSSD integration tests would allow us to write tests for the AD provider.
==== Tracking tickets ==== * https://fedorahosted.org/sssd/ticket/2818 - Investigate if Samba4 in Fedora can be used for SSSD CI
==== Testing ==== Some basic tests like looking up a user or a group would be part of this effort.
==== Scope ==== Medium
=== Decrease the functionality of the monitor process === SSSD is gradually moving to socket-activated services and in general more self-contained services rather than implementing a service manager in the monitor process. The scope of this work would be to further decrease the dependence of services on the monitor process, such as moving the register functionality to the services themselves. Ultimately, the monitor process would perform one-time tasks such as converting the configuration from INI to confdb and exit.
Other work might include a fallback configuration or starting the services and domains even without having them explicitly enumerated in the services section.
==== Benefit to SSSD ==== Socket-activatable services would be better manageable by SSSD.
==== Tracking tickets ==== * https://fedorahosted.org/sssd/ticket/2231 - RFE: Drop the monitor process
==== Testing ==== There are no upstream test for this functionality at the moment. Some service restart tests exist in downstream, though.
==== Scope ==== Medium to large, but hopefully this task could be split into several smaller tasks.
=== Memory cache changes === There are several improvements to the memory cache that we have been discussing lately, including a memory cache for by-SID lookups or having the memory cache respect case insensitive domains. The goal of this task would be to investigate what needs to be changed in the memory cache in order to implement these improvements.
==== Benefit to SSSD ==== Better performance through leveraging memory cache for SID lookups and lookups in case insensitive domains.
==== Tracking tickets ==== * https://fedorahosted.org/sssd/ticket/3193 - [RFE] Support aliases in the memory cache * https://fedorahosted.org/sssd/ticket/2727 - Add a memcache for SID-by-id lookups
==== Testing ==== We already have tests for memory cache which could be extended. Tests for by-SID lookups would probably require us to add the Samba-based tests first.
==== Scope ==== Probably large, but more investigation is needed.
== How To Test == Run all available upstream and downstream tests, if possible, extend the upstream tests.
On 10/03/2016 01:27 PM, Jakub Hrozek wrote:
we might be able to do some refactoring in the next couple of weeks prior to working on the next release.
I wrote up a proposal here: https://fedorahosted.org/sssd/wiki/DesignDocs/OneFifteenCodeRefactoring for your convenience, I copied the contents inline as well.
But the page is really meant to kick off the discussion, so feel free to add suggestions or push back against those that are included.
The main points to keep in mind are that the work should be doable in a couple of weeks along with our regular maintenance work and should benefit SSSD, not just 'this code is ugly, let's change it' kind of work.
I'm looking forward to the discussion, especially I would like to hear Michal's or Lukas' opinion about the memory cache changes..
=== Use the shared `cache_req` module for responder look-ups === Currently each responder (except the !InfoPipe responder and several parts of the NSS responder) copy the logic that checks the cache and contacts the Data Provider if needed. In 1.15, we should add the missing functionality into the cache_req module and convert the existing responders (especially those that look up users and groups, not necessarily other objects like autofs maps or hosts..) to cache_req.
==== Benefit to SSSD ==== In 1.15, we should look at allowing lookups from trusted domain with a shortname. But we need to take performance into account and avoid cycling over all domains including their LDAP server. Then we could switch to checking the caches of all domains first before checking each domain's cache and then its server.
This goal is tracked by https://fedorahosted.org/sssd/ticket/843 (Login time increases strongly if more than one domain is configured) and ultimately by https://fedorahosted.org/sssd/ticket/3001 ([RFE] Short name input format with SSSD for users from all domains when domain autodiscovery is used).
This will also help with implementing session recording (tlog integration). I couldn't find the corresponding ticket, sorry.
Nick
On 10/03/2016 01:46 PM, Nikolai Kondrashov wrote:
On 10/03/2016 01:27 PM, Jakub Hrozek wrote:
we might be able to do some refactoring in the next couple of weeks prior to working on the next release.
I wrote up a proposal here: https://fedorahosted.org/sssd/wiki/DesignDocs/OneFifteenCodeRefactoring for your convenience, I copied the contents inline as well.
But the page is really meant to kick off the discussion, so feel free to add suggestions or push back against those that are included.
The main points to keep in mind are that the work should be doable in a couple of weeks along with our regular maintenance work and should benefit SSSD, not just 'this code is ugly, let's change it' kind of work.
I'm looking forward to the discussion, especially I would like to hear Michal's or Lukas' opinion about the memory cache changes..
=== Use the shared `cache_req` module for responder look-ups === Currently each responder (except the !InfoPipe responder and several parts of the NSS responder) copy the logic that checks the cache and contacts the Data Provider if needed. In 1.15, we should add the missing functionality into the cache_req module and convert the existing responders (especially those that look up users and groups, not necessarily other objects like autofs maps or hosts..) to cache_req.
==== Benefit to SSSD ==== In 1.15, we should look at allowing lookups from trusted domain with a shortname. But we need to take performance into account and avoid cycling over all domains including their LDAP server. Then we could switch to checking the caches of all domains first before checking each domain's cache and then its server.
This goal is tracked by https://fedorahosted.org/sssd/ticket/843 (Login time increases strongly if more than one domain is configured) and ultimately by https://fedorahosted.org/sssd/ticket/3001 ([RFE] Short name input format with SSSD for users from all domains when domain autodiscovery is used).
This will also help with implementing session recording (tlog integration). I couldn't find the corresponding ticket, sorry.
Ah, here it is: https://fedorahosted.org/sssd/ticket/2893
Nick
On 10/03/2016 12:27 PM, Jakub Hrozek wrote:
Hi,
we might be able to do some refactoring in the next couple of weeks prior to working on the next release.
I wrote up a proposal here: https://fedorahosted.org/sssd/wiki/DesignDocs/OneFifteenCodeRefactoring for your convenience, I copied the contents inline as well.
=== Use the shared `cache_req` module for responder look-ups ===
Already in progress, some responders are done. I'm currently working on NSS. This is a must have and doable in the scope.
=== Refactor group lookups for better performance ===
Ack.
=== Refactor the sdap_id_ops.c module ===
The sdap_id_ops calls should at least be moved into low level ldap code and turned to one-return-code-only api. We often forget to use it or we use it incorrectly.
=== Provide a way to pass intermediate data between requests ===
This can be probably done nice and clean with our new DP interface. Just fire a new request with new data.
=== Upstream the PoC tests that utilize Samba AD for AD provider testing ===
For me, higher priority is to make CI test more useful when debugging.
I added two new sections: SBUS Improvements and Failover refactoring.
On Mon, Oct 3, 2016 at 2:07 PM, Pavel Březina pbrezina@redhat.com wrote:
On 10/03/2016 12:27 PM, Jakub Hrozek wrote:
Hi,
we might be able to do some refactoring in the next couple of weeks prior to working on the next release.
I wrote up a proposal here:
https://fedorahosted.org/sssd/wiki/DesignDocs/OneFifteenCodeRefactoring for your convenience, I copied the contents inline as well.
=== Use the shared `cache_req` module for responder look-ups ===
Already in progress, some responders are done. I'm currently working on NSS. This is a must have and doable in the scope.
=== Refactor group lookups for better performance ===
Ack.
=== Refactor the sdap_id_ops.c module ===
The sdap_id_ops calls should at least be moved into low level ldap code and turned to one-return-code-only api. We often forget to use it or we use it incorrectly.
=== Provide a way to pass intermediate data between requests ===
This can be probably done nice and clean with our new DP interface. Just fire a new request with new data.
=== Upstream the PoC tests that utilize Samba AD for AD provider testing
For me, higher priority is to make CI test more useful when debugging.
I added two new sections: SBUS Improvements and Failover refactoring.
How the SBUS Improvements are linked with the (possible) idea to make SBUS a library and then we can link easier against it? Asking because I had some issues trying to link against SBUS when playing with IFP responder and then that's what you suggested me.
Best Regards, -- Fabiano Fidêncio
On Mon, Oct 03, 2016 at 02:07:25PM +0200, Pavel Březina wrote:
On 10/03/2016 12:27 PM, Jakub Hrozek wrote:
Hi,
we might be able to do some refactoring in the next couple of weeks prior to working on the next release.
I wrote up a proposal here: https://fedorahosted.org/sssd/wiki/DesignDocs/OneFifteenCodeRefactoring for your convenience, I copied the contents inline as well.
=== Use the shared `cache_req` module for responder look-ups ===
Already in progress, some responders are done. I'm currently working on NSS. This is a must have and doable in the scope.
=== Refactor group lookups for better performance ===
Ack.
=== Refactor the sdap_id_ops.c module ===
The sdap_id_ops calls should at least be moved into low level ldap code and turned to one-return-code-only api. We often forget to use it or we use it incorrectly.
Yes, this part I think is doable in the four-weeks time. The other thing we talked about (wrapping the searches in a request that takes care of everything and optionally just returns a single error), probably not..
=== Provide a way to pass intermediate data between requests ===
This can be probably done nice and clean with our new DP interface. Just fire a new request with new data.
So the DP interface would return sysdb_attrs* or a similar data type?
=== Upstream the PoC tests that utilize Samba AD for AD provider testing ===
For me, higher priority is to make CI test more useful when debugging.
Fabiano did a lot of work there, what is missing in your opinion?
I added two new sections: SBUS Improvements
Can you give me (or add to the page :-)) an example of what is wrong with the current sbus code? I don't disagree per se, I guess I don't see the issue, though..
and Failover refactoring.
Yes, I was expecting you add this :) but I also agree this is probably not in the scope..but then the sdap_id_ops refactoring would be even more useful.
On 10/03/2016 03:15 PM, Jakub Hrozek wrote:
On Mon, Oct 03, 2016 at 02:07:25PM +0200, Pavel Březina wrote:
On 10/03/2016 12:27 PM, Jakub Hrozek wrote:
Hi,
we might be able to do some refactoring in the next couple of weeks prior to working on the next release.
I wrote up a proposal here: https://fedorahosted.org/sssd/wiki/DesignDocs/OneFifteenCodeRefactoring for your convenience, I copied the contents inline as well.
=== Use the shared `cache_req` module for responder look-ups ===
Already in progress, some responders are done. I'm currently working on NSS. This is a must have and doable in the scope.
=== Refactor group lookups for better performance ===
Ack.
=== Refactor the sdap_id_ops.c module ===
The sdap_id_ops calls should at least be moved into low level ldap code and turned to one-return-code-only api. We often forget to use it or we use it incorrectly.
Yes, this part I think is doable in the four-weeks time. The other thing we talked about (wrapping the searches in a request that takes care of everything and optionally just returns a single error), probably not..
=== Provide a way to pass intermediate data between requests ===
This can be probably done nice and clean with our new DP interface. Just fire a new request with new data.
So the DP interface would return sysdb_attrs* or a similar data type?
=== Upstream the PoC tests that utilize Samba AD for AD provider testing ===
For me, higher priority is to make CI test more useful when debugging.
Fabiano did a lot of work there, what is missing in your opinion?
I added two new sections: SBUS Improvements
Can you give me (or add to the page :-)) an example of what is wrong with the current sbus code? I don't disagree per se, I guess I don't see the issue, though..
From the top of my mind, there are two main issues:
1) handler return EOK if reply to the request was send or is in asynchronous processing, so there are several places in the code where the handler can be finished. This is the same thing what we dislike in data providers where be_req_terminate was called on many places and it made the code difficult and error prone.
E.g. this crash: e8474ac0be7e81c0ca54eb09e2fef42595602945
I'd like to do something similar as I did in data providers, but maybe not in such extend.
2) Error replies -- there is just too much code around when you want to reply with error.
I've already done something when working on sssctl but it needs to be unified across the code. It is not much work.
E.g. a06e23c0bcf0c8669a29b801876aca8aac422931
3) Making it a separate static library as part of this effort would be great. It is a perfect candidate.
and Failover refactoring.
Yes, I was expecting you add this :) but I also agree this is probably not in the scope..but then the sdap_id_ops refactoring would be even more useful.
sssd-devel@lists.fedorahosted.org