Re: [SSSD] [PATCH] back end: periodic task API + refresh of expired records

Monday, 10 June 2013

On Mon, Jun 10, 2013 at 07:04:51PM +0200, Jakub Hrozek wrote:
...
 On Mon, Jun 10, 2013 at 12:02:10PM +0200, Pavel Březina wrote:
 > On 06/10/2013 11:25 AM, Jakub Hrozek wrote:
 > >On Mon, Jun 10, 2013 at 10:03:09AM +0200, Pavel Březina wrote:
 > >>On 06/09/2013 07:53 PM, Dmitri Pal wrote:
 > >>>On 06/09/2013 08:05 AM, Jakub Hrozek wrote:
 > >>>>On Mon, Jun 03, 2013 at 10:15:49PM +0200, Pavel Březina wrote:
 > >>>>>On 06/03/2013 12:17 AM, Jakub Hrozek wrote:
 > >>>>>>On Thu, 2013-05-23 at 11:46 +0200, Pavel Březina wrote:
 > >>>>>>>On 05/21/2013 12:06 PM, Jakub Hrozek wrote:
 > >>>>>>>>In general it's not clear to me from the code
who is responsible for
 > >>>>>>>>freeing the task -- is it the caller, but only on
success? I think this
 > >>>>>>>>should always behave the same.
 > >>>>>>>After discussion with Jakub we decided that the caller
is responsible
 > >>>>>>>for freeing the task.
 > >>>>>>>
 > >>>>>>>When an internal error occur (unable to create
tevent_timer), the task
 > >>>>>>>is now disabled.
 > >>>>>>>
 > >>>>>>>ERR_STOP_PERIODIC_TASK was removed. The request can
disable the task
 > >>>>>>>instead.
 > >>>>>>>
 > >>>>>>>>be_ptask_schedule(): The DEBUG message uses %d but
passes in a string.
 > >>>>>>>Fixed.
 > >>>>>>>
 > >>>>>>>>[PATCH 2/4] back end: periodical refresh of expired
records API
 > >>>>>>>>The "enum be_refresh" name is too generic.
Maybe "enum be_refresh_type"
 > >>>>>>>>?
 > >>>>>>>Renamed.
 > >>>>>>>
 > >>>>>>>>In be_refresh_get_names(), can you use
sysdb_attrs_to_list() to gather
 > >>>>>>>>the names?
 > >>>>>>>Done.
 > >>>>>>>
 > >>>>>>>>Is it wise to assume that all object have names and
name the getter
 > >>>>>>>>get_names()? Some object might not have names at
all, but for instance
 > >>>>>>>>UUIDs.. Also in be_refresh_step() you call get_names
but get back a
 > >>>>>>>>"dn".
 > >>>>>>>Fixed.
 > >>>>>>>
 > >>>>>>>>Please file tickets to add unit tests for these two
modules and make the
 > >>>>>>>>tickets block upstream #1923.
 > >>>>>>>https://fedorahosted.org/sssd/ticket/1939 be_ptask
 > >>>>>>>https://fedorahosted.org/sssd/ticket/1940 be_refresh
 > >>>>>>>
 > >>>>>>>>Nitpick:
 > >>>>>>>>>+    filter = talloc_asprintf(tmp_ctx,
"(&(%s<=%lld))",
 > >>>>>>>>>+                            
SYSDB_CACHE_EXPIRE, (long long)now);
 > >>>>>>>>                                                    
         ^^^
 > >>>>>>>>                                                    
     missing space
 > >>>>>>>Fixed.
 > >>>>>>>
 > >>>>>>>>[PATCH 3/4] back end: add refresh expired records
periodic task
 > >>>>>>>>>--- a/src/config/SSSDConfig/__init__.py.in
 > >>>>>>>>>+++ b/src/config/SSSDConfig/__init__.py.in
 > >>>>>>>>>@@ -125,6 +125,7 @@ option_strings = {
 > >>>>>>>>>       'entry_cache_service_timeout' :
_('Entry cache timeout length (seconds)'),
 > >>>>>>>>>       'entry_cache_autofs_timeout' :
_('Entry cache timeout length (seconds)'),
 > >>>>>>>>>       'entry_cache_sudo_timeout' :
_('Entry cache timeout length (seconds)'),
 > >>>>>>>>>+    'refresh_expired_interval' :
_('How often should expired rules be refreshed in background'),
 > >>>>>>>>We should use "objects" or
"entries" instead of rules.
 > >>>>>>>Fixed.
 > >>>>>>>
 > >>>>>>>>Can you explain the "circular dependency"
here? I think you already told
 > >>>>>>>>me in person, but I forgot..
 > >>>>>>>dp_backend.h defines struct be_ctx.
 > >>>>>>>
 > >>>>>>>dp_ptask.h uses struct be_ctx.
 > >>>>>>>dp_refresh.h uses struct be_ctx.
 > >>>>>>>
 > >>>>>>>
 > >>>>>>>      dp_backend.h <---- dp_ptask.h
 > >>>>>>>         | ^                ^
 > >>>>>>>         | |                |
 > >>>>>>>         v |                |
 > >>>>>>>     dp_refresh.h -----------
 > >>>>>>>
 > >>>>>>>data_provider_be.c includes all three of them (it needs
to create ptask
 > >>>>>>>for be_refresh and remember it in be_ctx).
 > >>>>>>>
 > >>>>>>>>>--- a/src/providers/dp_ptask.h
 > >>>>>>>>>+++ b/src/providers/dp_ptask.h
 > >>>>>>>>>@@ -27,6 +27,9 @@
 > >>>>>>>>>
 > >>>>>>>>>   #include "providers/dp_backend.h"
 > >>>>>>>>>
 > >>>>>>>>>+/* solve circular dependency */
 > >>>>>>>>>+struct be_ctx;
 > >>>>>>>>>+
 > >>>>>>>>>   struct be_ptask;
 > >>>>>>>>>
 > >>>>>>>>>   /**
 > >>>>>>>>>diff --git a/src/providers/dp_refresh.h
b/src/providers/dp_refresh.h
 > >>>>>>>>>index
54c6aac99866415d4f33f61ac05d487f71b8c49c..d93a932637804dfbf743c395af3f8e640f79c12f 100644
 > >>>>>>>>>--- a/src/providers/dp_refresh.h
 > >>>>>>>>>+++ b/src/providers/dp_refresh.h
 > >>>>>>>>>@@ -27,6 +27,9 @@
 > >>>>>>>>>   #include "providers/dp_backend.h"
 > >>>>>>>>>   #include "providers/dp_ptask.h"
 > >>>>>>>>>
 > >>>>>>>>>+/* solve circular dependency */
 > >>>>>>>>>+struct be_ctx;
 > >>>>>>>>>+
 > >>>>>>>>>>[PATCH 4/4] providers: refresh expired
netgroups
 > >>>>>>>>>>I don't see one important part
implemented -- what if all netgroups
 > >>>>>>>>>>expire at once, then they are all refreshed
at once, right? Can we add
 > >>>>>>>>>>some throttling and refresh them in batches
with some delay to avoid
 > >>>>>>>>>>starving the back end?
 > >>>>>>>>>I thought we agreed in person that we will first
push a basic stupid
 > >>>>>>>>>solution and than fine tune it if necessary -
hopefully based on real
 > >>>>>>>>>environment.
 > >>>>>>>>OK, I thought there was going to be some proof of
concept that we would
 > >>>>>>>>just amend based on testing. Please file a ticket
about this, we'll get
 > >>>>>>>>the throttling implemented post-beta.
 > >>>>>>>https://fedorahosted.org/sssd/ticket/1941 throttling
 > >>>>>>>
 > >>>>>>>https://fedorahosted.org/sssd/ticket/1942 convert
enumeration
 > >>>>>>>https://fedorahosted.org/sssd/ticket/1943 convert sudo
 > >>>>>>>https://fedorahosted.org/sssd/ticket/1944 convert
dyndns
 > >>>>>>>
 > >>>>>>>Also fix make check problems.
 > >>>>>>Thank you, I like the new code. I re-read the patches again
and actually
 > >>>>>>tested them quite a lot this time.
 > >>>>>>
 > >>>>>>Here are some more comments:
 > >>>>>>
 > >>>>>>[PATCH 1/4] back end: periodic task API
 > >>>>>>Here I only have minor comments. I think it would be
defensive to add a
 > >>>>>>NULL check for "task" to be_ptask_online_cb() and
maybe even
 > >>>>>>be_ptask_destructor(). The reason is that the task is now
(correctly)
 > >>>>>>owned by the called but especially the online callback might
be called
 > >>>>>>based on external event.
 > >>>>>Done.
 > >>>>>
 > >>>>>>In be_ptask_create() I think assigning to "_task"
should be moved to
 > >>>>>>when "EOK" is assigned to "ret" to only
modify external variable when
 > >>>>>>the function succeeds.
 > >>>>>Of course.
 > >>>>>Done.
 > >>>>>
 > >>>>>>[PATCH 2/4] back end: periodical refresh of expired records
API
 > >>>>>>In be_refresh_add_cb() I agree with checking the parameters
as it's
 > >>>>>>quite a generic API, but you could also check "enum
be_refresh_type
 > >>>>>>type" for being lesser than BE_REFRESH_TYPE_SENTINEL
and return EINVAL
 > >>>>>>(or a new error code) otherwise.
 > >>>>>Done.
 > >>>>>
 > >>>>>>I wonder if it would be cleaner to move the
"talloc_steal(subreq,
 > >>>>>>values);" after creating the subreq and before setting
the callback to
 > >>>>>>make sure the values are always owned by subreq? It probably
doesn't
 > >>>>>>matter that much from functional point of view as values is
otherwise
 > >>>>>>owned by state, but it would make the code more consistent
to me.
 > >>>>>Done.
 > >>>>>
 > >>>>>>I think that EAGAIN is mishandled in be_refresh_done(). The
done label
 > >>>>>>errors out in any case, but EAGAIN would be returned in case
there was
 > >>>>>>more entities to return:
 > >>>>>Thanks. Done.
 > >>>>>
 > >>>>>>[PATCH 3/4] back end: add refresh expired records periodic
task
 > >>>>>>The patch looks good to me, but I'm wondering about the
defaults of
 > >>>>>>be_ptask_create. I would suggest making them shorter, maybe
even half of
 > >>>>>>the current values.
 > >>>>>OK. First delay is now 30, enable delay is 5.
 > >>>>>
 > >>>>>>[PATCH 4/4] providers: refresh expired netgroups
 > >>>>>>EAGAIN is mishandled in this patch, too. I did the following
change to
 > >>>>>>make the refresh work correctly:
 > >>>>>Fixed.
 > >>>>>
 > >>>>>>I also need to take another look at the responder caching,
because in my
 > >>>>>>tests with about 1000 netgroups, I was still getting the
data provider
 > >>>>>>contacted even if I set relatively long netgroup timeout and
short
 > >>>>>>background refresh. But with the above changes, the
background refresh
 > >>>>>>seems to have been working fine.
 > >>>>>Did the background refresh finished before you tried to log in
(or whatever
 > >>>>>test did you performed)?
 > >>>>>
 > >>>>I created a very complex netgroups structure and ran getent
netgroup
 > >>>><top-netgroup>. This is exactly what the customer was doing.
 > >>>>
 > >>>>Sometimes the entries were returned from cache, sometimes I hit the
 > >>>>midpoint refresh, so after a couple of tries, the expiry timestamps
were
 > >>>>different for different netgroups. Then after a couple of tries, I
hit a
 > >>>>window where the entry was already expired but the task didn't
pick it
 > >>>>up yet.
 > >>>>
 > >>>>So here is a proposal -- instead of refreshing expired entries of
 > >>>>certain objectclass, what about refresh all entries of that
objectclass
 > >>>>periodically regardless of their expiration time? The timeouts could
be
 > >>>>configured using the expiry timeouts anyway.
 > >>>>
 > >>>>Alternatively we could make the task configurable (maybe with
another
 > >>>>option or with a sane default) so that it picks entries N percent
into their
 > >>>>expiration timeout so that we can be sure noone ever hits an
expired
 > >>>>entry. But this sounds like v 2.0 of this feature to me.
 > >>>>
 > >>>>What do you think?
 > >>>
 > >>>May be have a flag netgroups_refresh_mode that would force to always
 > >>>return whatever is available and force the refresh in the background.
 > >>>So here is how it would work:
 > >>>netgroups_refresh_mode = fetch | nowait | background
 > >>>
 > >>>fetch - if you come across an expired entry fetch an update before
 > >>>returning data (current behavior)
 > >>>nowait - return whatever is there and rely on the refresh job to
refresh
 > >>>the data - this can return stale data
 > >>>background - if you come across the expired data return it immediately
 > >>>but expedite the refresh - this can return stale data too.
 > >>
 > >>Hi,
 > >>I don't think creating more options is necessary. How about refreshing
those
 > >>records that are expired or are about to expire before next refresh is
 > >>triggered?
 > >>
 > >>This is only simple filter change and should guarantee that netgroups will
 > >>never expire (if period is lower than expiration time) but also keeps
 > >>traffic distribution over time.
 > >
 > >I tested this locally as a modification of Pavel's patches and I confirm
 > >this modification had the desired effect in my testing. After I switched
 > >off midpoint refresh completely to rely on the background refresh task
 > >only, I saw no more delay in requesting the netgroups.
 > >
 > >Can you send the modified version Pavel? I think it will be good to go.
 > 
 > Here is rebased + modified version.
 > 
 > To make the review easier I'm also attaching the diff patch. The changes
 > were squashed into the first and the forth patch.

 Thank you, these patches work for me.

 ACK. 
Pushed to master.

Can you also prepare a 1.9 version of these patches?

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009

Re: [SSSD] [PATCH] back end: periodic task API + refresh of expired records