On 08/17/2015 02:42 PM, Lukas Slebodnik wrote:
On (13/08/15 13:31), Michal Židek wrote:
> On 08/13/2015 07:05 AM, Lukas Slebodnik wrote:
>> On (12/08/15 14:35), Michal Židek wrote:
>>>
>>> Seriously? Is that your proposed solution?
>>>
>> Yes, it's better to remove test rather then
>> have intermittent failures.
>
> You are twisting the plot here. Removing the test
> would only be an option if we had no idea how to
> fix them. Which was definitely not the case.
>
the other alternative is we cannot agree on solution.
BTW I agree with Nikolai,
On Thu, Aug 13, 2015 at 01:40:04PM +0300, Nikolai Kondrashov wrote:
> I suggest we disable the tests that fail, even if intermittently, and file
> corresponding bugs (if not filed yet). Otherwise CI will soon become useless.
So it's better to immediately disable problematic tests.
>>> It is better to go with the patches you sent before
>>> (not the one that removes the tests, that would be
>>> very wrong) rather than trying to continue this
>>> discussion.
>>>
>> There's nothing to continue with discussion,
>>
>> Q:Why?
>> You proposed to add unnecessary code complication to
>> workaround race condition. It will not fix the race condition.
>> Because the same issue can happen next time.
>
> It was not unnecessary and it can barely be called a
> complication. As you said it is very unlikely to happen
> but it definitely can happen.
>
It can happen in very x 107 rare cases. But it did not solve it
completly. So it is just unnecessary complication.
>> Q:How is it possible?
>> Becausue veryx107 unlikely race condition is typical TOCTOU[1].
>> In teory, you can hit this same situation when sombody restarted sssd.
>> My paches added 1 second delay and you proposed try one more time.
>> It is even less likely that it will happen but sssd can be restarted
>> after 1 second as well and you will hit the same race condition next time.
>> So we would add unnecessary complicated code which does not solve anything.
>> The most important reason to not use your proposal is that
>> patches for ticket #2748 did not introduce the race condition.
>>
>>
>> Q:So does it mean that race condition is already in master?
>> yes,
>
> I do know it is in master. What is the point if this?
>
Yes, it is in master and I also provided a hash :-)
So the race condition is caused by two main factors.
a) detections of running sssd_nss is not reliable. It's just sufficient enough.
If there is a process which owns file locks it needn;t be a sssd_nss.
In teory, it can be another sss_cache utility.
b) "sending message" to sssd_nss is not releable either. It's just
sufficient
enough. Nobody guarantee that if you create the file clear_mc_flag.
It does not guarantee that sssd_nss will recieve the message.
So here is how can the race condition happen
1. sss_cache will cannot obtain file locks to memory caches
(there is an asumption that sssd_nss is running)
2. sssd is stopped
3. sss_cache will "send the message" to sssd_nss (create file clear_mc_flag)
However sssd_nss cannot process such message; because it is not running.
BTW it's the same situation as with TCP and UDP on local network.
UDP on local network is sufficient enough for sending simple messages
but nobody guarantee receiving of that message. If you want to use
something reliable you need to use TCP.
My patch didn't touch anything about file locks and did't touch code
with "sending message" to sssd_nss. So it could not create this race
condition.
Moreover, such race condition cannot happen in integration tests;
stopping sssd and calling sss_cache will never happen in parallel.
>> It was introduced by commit 33cbb789ff71be5dccbb4a0acd68814b0d53da34
>> and will not solve the ticket #2748
>>
>> commit 33cbb789ff71be5dccbb4a0acd68814b0d53da34
>> Author: Michal Zidek <mzidek(a)redhat.com>
>> AuthorDate: Fri Oct 26 17:36:51 2012 +0200
>> Commit: Jakub Hrozek <jhrozek(a)redhat.com>
>> CommitDate: Tue Nov 6 12:29:28 2012 +0100
>>
>> sss_cache: Remove fastcache even if sssd is not running.
>>
>>
https://fedorahosted.org/sssd/ticket/1584
>>
>>
>> Q: But it's a race conditions and we should fix it; or no?
>> As I already wrote it's a is typical TOCTOU. The problematic
>> code is neither in library nor daemon part of sssd.
>> The proper solution without atomic function would be very complicated.
>> We do not have atomic function obtain locks or create file.
>> The full solution would require complicated analysis of all corner
>> case. The ideal would be to use final state automata(FSA) or another
>> formal method to prove there race condition is fixed. It's not a solution to
>> retry one more time after a second. It will not fix the race condition and code
>> will become complicated. The proper solution would be even more
>> complicated. FSA converted to programming language are not very readable.
>> It really does not worth such solution in a command line utility
>>
>
> The change I proposed is simple. It basically goes like this:
> 1. You hit the race condition -> try again few times
> 2. If you try at least one more time and you are not constantly
> restarting sssd, the problem is solved and the caches will be
> removed with the second attempt.
> 3. Even if you are constantly restarting sssd, the possibility to
> hit the "bad condition" several times in a row (I had 3 tries
> in the patch) is much (very much) less likely to happen. In
> this case you can call it hardening.
>
The main problem is that it will not solve it completly.
You just decrese very x 107 low probability to even lower probability.
It does not worth. If you want to really fix it; it's better to solve
it completly or do not complicate code at all.
The same analogy as with UDP -> TCP. UDP is either sufficient enough
or you need to use something reliable (TCP).
>> The nss responder did not have a time to invalidate cache.
>> And a second delay should be enough time for cache invalidation in nss
>> responder. Otherwise there is an issue with nss responder and should be fixed.
>> That's the reason of an error message logged from the utility sss_cache.
>>
>>
>>> If someone wants to review the small patch I
>>> sent as separate patch feel free to do so. I
>>> still think it should be fixed.
>>>
>> The patch was NACKed.
>> Part of explanation was provided offline two days ago
>> and part of it was explained above
>>
>> Short summary:
>> * missing unit test for race condition.
>
> So if I find an issue in your patches during review
> I first have to create a unit test that will give
> me a special right to raise my concerns. Great.
>
As I already explain few times. It was not issue in my patches.
the race condition is already in master and you want from me
to fix unrelated issues as part of unrelated ticket.
So you blocked pushing patches.
>> * unnecessary complicated code in command line utility.
>
> I do not think so.
>
>> * proposed patch did not solve the race condition.
>
> It does. Of course if you constanly restart sssd and
> call sss_cache with it, you may "theoreticaly" hit
> the same condition with each restart. But since the
> issue is unlikely to happen, I think the proposal I
> made is reasonable and sufficient hardening.
>
One more time:
It does not solve race condition;
it just decrease a very x 107 low probability.
>> * proposed patch did not solve #2748. It's unrelated to #2748
>
> It is related to the review of solution of #2748. I simply
> wanted you to change behavior in part of the code you added as
> the solution, because the current behavior is not
> ideal in case of this "unlikely to happen" situation.
>
One more time:
Current behaviour was not caused by my patches;
it was already in master.
>> As I already wrote fell free to propose better error message
>> becasue it's part of translated strings.
>
> Feel free to do so yourself. I am no longer reviewing these
> patches. If you added the changes I wanted, the error message
> would be almost irrelevant.
>
The error message is sufficient solution
for very x 107 unlikely race condition.
It's still better then current situation in sssd < 1.13.0.
The race condition can happen but nobody will now about it.
But The code is there for long time and nobody complains.
I thing it is better to be defensive no matter if someone
complains or not.
Neiter downstream QA not customer. So it really means
it is very x 107 unlikely race condition which does not worth to solve.
I agree that it is not very significant issue and I know we
do not have tests that would trigger it so far, because we
do not run sss_cache in parallel with stopping/restarting
SSSD. And even then it would be difficult to hit, because
the timeframe for this race condition is very narrow.
But If you really inisist on fixing the race condition.
I can file a ticket to get rid of it.
Decreasing probability is not sufficient for such very x 107 unlikey race
condition.
As I mentioned already, there are two scenarios I had in mind:
1. SSSD stops once and sss_cache is called - in this scenario
the additional hardening I proposed would *completely* remove
the problem.
2. SSSD restarts in a loop - in this scenario the additional
hardening would decrease the probability of sss_cache
tool being unable to remove the caches.
I do not want to complicate the code either but
I think it is worth adding one simple loop to the
code to get this additional hardening.
Michal
LS