On 08/15/2016 09:59 AM, Jakub Hrozek wrote:
> On Mon, Aug 15, 2016 at 09:47:27AM +0200, Petr Cech wrote:
>> On 08/12/2016 04:13 PM, Jakub Hrozek wrote:
>>> On Fri, Aug 12, 2016 at 03:41:26PM +0200, Petr Cech wrote:
>>>> On 08/12/2016 03:07 PM, Jakub Hrozek wrote:
>>>>>> Logs now look like:
>>>>>>>
>>>>>>> [root@albireo sssd]# grep 'child'
sssd_ipa.cygnus.dev.log
>>>>>>> [child_handler_setup] (0x2000): Setting up signal handler up
for
>>>>>>> pid [18835]
>>>>>>> [child_handler_setup] (0x2000): Signal handler set up for
pid
>>>>>>> [18835]
>>>>>>> [set_tgt_child_timeout] (0x0400): Setting 6 seconds timeout
for
>>>>>>> tgt child
>>>>>>>
>>>>>>> (6 seconds later)
>>>>>>>
>>>>>>> [get_tgt_timeout_handler] (0x4000): timeout for sending
SIGTERM
>>>>>>> to tgt child
>>>>>>> [18835] reached.
>>>>>>> [get_tgt_timeout_handler] (0x0400): Setting 2 seconds
timeout
>>>>>>> for sending
>>>>>>> SIGKILL to tgt child
>>>>>>> [sdap_get_tgt_recv] (0x0020): Cannot parse child response:
>>>>>>> [22][Invalid
>>>>>>> argument]
>>>>>>> [sdap_kinit_done] (0x0020): child failed (22 [Invalid
argument])
>>>>>>> [child_sig_handler] (0x1000): Waiting for child [18835].
>>>>>>> [child_sig_handler] (0x0020): child [18835] failed with
status [7].
>>>>>>> [child_callback] (0x0020): LDAP child was terminated due to
timeout
>>>>> I'm sorry, but these patches still don't fix the issue I was
seeing.
>>>>> Before the patches, when I timed out the child process, I saw:
>>>>> (Fri Aug 12 14:54:24 2016) [[sssd[ldap_child[31721]]]]
>>>>> [ldap_child_get_tgt_sync] (0x0100): Principal name is:
>>>>> [host/client.ipa.test(a)IPA.TEST]
>>>>> (Fri Aug 12 14:54:24 2016) [[sssd[ldap_child[31721]]]]
>>>>> [ldap_child_get_tgt_sync] (0x0100): Using keytab
>>>>> [MEMORY:/etc/krb5.keytab]
>>>>> (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
>>>>> [get_tgt_timeout_handler] (0x4000): timeout for tgt child [31721]
>>>>> reached.
>>>>> (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [sdap_kinit_done]
>>>>> (0x0080): Communication with KDC timed out, trying the next one
>>>>> (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
>>>>> [_be_fo_set_port_status] (0x8000): Setting status:
>>>>> PORT_NOT_WORKING. Called from:
>>>>> /sssd/src/providers/ldap/sdap_async_connection.c: sdap_kinit_done:
>>>>> 1207
>>>>> (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
>>>>> [fo_set_port_status] (0x0100): Marking port 0 of server
>>>>> 'unidirect.ipa.test' as 'not working'
>>>>> (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
>>>>> [fo_set_port_status] (0x0400): Marking port 0 of duplicate server
>>>>> 'unidirect.ipa.test' as 'not working'
>>>>> (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
>>>>> [sdap_kinit_next_kdc] (0x1000): Resolving next KDC for service
>>>>> KERBEROS
>>>>> (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
>>>>> [fo_resolve_service_send] (0x0100): Trying to resolve service
>>>>> 'KERBEROS'
>>>>> (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_server_status]
>>>>> (0x1000): Status of server 'unidirect.ipa.test' is 'name
resolved'
>>>>> (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]] [get_port_status]
>>>>> (0x1000): Port status of port 0 for server
'unidirect.ipa.test' is
>>>>> 'not working'
>>>>> (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
>>>>> [fo_resolve_service_send] (0x0020): No available servers for
>>>>> service 'KERBEROS'
>>>>> (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
>>>>> [be_resolve_server_done] (0x1000): Server resolution failed: [5]:
>>>>> Input/output error
>>>>> (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
>>>>> [sdap_cli_kinit_done] (0x0400): Cannot get a TGT: ret
>>>>> [1432158228](Network I/O Error)
>>>>> (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
>>>>> [sdap_cli_connect_recv] (0x0040): Unable to establish connection
>>>>> [13]: Permission denied
>>>>> (Fri Aug 12 14:54:30 2016) [sssd[be[ipaldap]]]
>>>>> [_be_fo_set_port_status] (0x8000): Setting status:
>>>>> PORT_NOT_WORKING. Called from:
>>>>> /sssd/src/providers/ldap/sdap_async_connection.c:
>>>>> sdap_cli_connect_recv: 2048
>>>>>
>>>>> After the patch, I see:
>>>>> (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]]
>>>>> [get_tgt_timeout_handler] (0x4000): timeout for sending SIGTERM to
>>>>> tgt child [17291] reached.
>>>>> (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]]
>>>>> [get_tgt_timeout_handler] (0x0400): Setting 2 seconds timeout for
>>>>> sending SIGKILL to tgt child
>>>>> (Fri Aug 12 15:01:05 2016) [[sssd[ldap_child[17291]]]]
>>>>> [sig_term_handler] (0x0010): Received signal [Terminated] [15],
>>>>> shutting down
>>>>> (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [read_pipe_handler]
>>>>> (0x0400): EOF received, client finished
>>>>> (Fri Aug 12 15:01:05 2016) [sssd[be[ipaldap]]] [sdap_get_tgt_recv]
>>>>> (0x0020): Cannot parse child response: [22][Invalid argument]
>>>>> here ---^
>>>>>
>>>>> This is the part I don't like, we try to read the response from
>>>>> the child's
>>>>> pipe and fail with a bad error message. I thought this was because
>>>>> with
>>>>> the preivous patch, we exit the child with zero, but I guess this
>>>>> was not
>>>>> the case.. Anyway, we still should fix this, the message would be
>>>>> really
>>>>> confusing to admins.
>>>>
>>>> Hi Jakub,
>>>>
>>>> I know I had the same
>>>> 'Cannot parse child response: [22][Invalid argument]'
>>>> in my logs too, see my last mail above.
>>>>
>>>> Some lines after it I have new message
>>>> '[child_callback] (0x0020): LDAP child was terminated due to
timeout'
>>>>
>>>> I understand that it is not satisfying. The main purpose of this
>>>> patch was
>>>> deleting temporary files if timeout reached. And this is IMO solved
>>>> I think.
>>>> It is true that we have big effort to have understandable logs.
>>>>
>>>> Yesterday I looked deeper at mechanism how signals, child handler
>>>> and all
>>>> such stuff works in ldap_child case. I wasn't able to tell to
>>>> sdap_get_tgt_recv() that the state is changed and we shouldn't
>>>> parse the
>>>> answer.
>>>>
>>>> If I understand handler stuff right I need introduce in ldap_child
>>>> handler
>>>> whole mechanism like in src/providers/be_dyndns.c
>>>>
>>>> The mechanism is calling the child_handler_setup() with all
>>>> parameters which
>>>> means rewrite sdap_child_helpers in appropriate manner.
>>>
>>> I think it would be enough to pass the tevent_req to the
>>> child_callback() and cancel it with tevent_req_error(req, ETIMEDOUT)
>>>
>>> The problem is that with the current patches, the whole request is
>>> marked as 'succeeded', that's why the _recv() function tries to
parse
>>> the response. If you look at sdap_get_tgt_recv(), if the request is
>>> marked as failure, the processing aborts in
>>> TEVENT_REQ_RETURN_ON_ERROR..
>>
>> Hi Jakub,
>>
>> I tried resolve it by your advice. I have still the same problem --
>> request
>> is unreachable in child_callback() (Bad magic value).
>>
>> I found one little hack. If we sent SIGTERM signal to child, child
>> can not
>> send its answer so we can say child is already time-outed. And after
>> this we
>> are able to write debug message in child_callback().
>
> Why does it need to send any answer? The parent would see the special
> exit code and use that to know the child timed out. The point is the
> child should not send any answer in the failure cases, only report the
> reason why it finished with an exit code.
>
>>
>> Please see last 5 commits here:
>>
https://github.com/celestian/sssd/commits/t3106_v1
>
> Interesting, is the correct pointer address being passed as pvt to:
> child_ctx->cb(child_ctx->child_status, child_ctx->sige,
> child_ctx->pvt);
> ?
>
> You can also try running sssd_be with valgrind for more clues to see if
> the request is maybe terminated before.
>
>>
>> 627200c WIP: LDAP: Handling of SIGCHLD <-- There are both solution
>> commented
>> in code
>> 888df5b DEBUG: Sleep <-- Adds sleep loop for timeouting the child
>>
>> 9235712 LDAP: Adding SIGCHLD callback
>> 625191d LDAP: Adding SIGTERM signal before SIGKILL
>> 5331972 LDAP: Adding support for SIGTERM signal
Hi all,
I discussed this issue with Jakub and we found a solution (patch #3).
Please see the attached patch set.
And because Jakub is interested in it I would like to ask somebody else
for code review. Thanks.
Regards
Hi, child_callback() should be static. Besides those patches LGTM.