https://bugzilla.redhat.com/show_bug.cgi?id=1104843
Bug ID: 1104843 Summary: rabbitmqctl doesn't work Product: Fedora Version: 20 Component: rabbitmq-server Severity: high Priority: urgent Assignee: hubert.plociniczak@gmail.com Reporter: jeckersb@redhat.com QA Contact: extras-qa@fedoraproject.org CC: apevec@redhat.com, erlang@lists.fedoraproject.org, fdinitto@redhat.com, hubert.plociniczak@gmail.com, jeckersb@redhat.com, lemenkov@gmail.com, lhh@redhat.com, rjones@redhat.com, s@shk.io Depends On: 1104193 Blocks: 1083890
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1104193 [Bug 1104193] rabbitmqctl doesn't work
https://bugzilla.redhat.com/show_bug.cgi?id=1104843
John Eckersberg jeckersb@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Comment #0 is|1 |0 private| |
https://bugzilla.redhat.com/show_bug.cgi?id=1104843
Peter Lemenkov lemenkov@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Assignee|hubert.plociniczak@gmail.co |lemenkov@gmail.com |m |
https://bugzilla.redhat.com/show_bug.cgi?id=1104843
--- Comment #1 from John Eckersberg jeckersb@redhat.com --- Some more details on this.
I used systemtap to prove that systemd is explicitly killing the epmd process:
Wed Jun 4 20:18:09 2014 : sh (16202) is exec'ing "/usr/lib64/erlang/erts-5.10.4/bin/epmd" Wed Jun 4 20:18:09 2014 : epmd (16202) created 16204 Wed Jun 4 20:18:09 2014 : epmd (16202) is exiting Wed Jun 4 20:18:09 2014 : epmd (16204) created 16205 Wed Jun 4 20:18:09 2014 : epmd (16204) is exiting Wed Jun 4 20:18:09 2014 : sh (16203) is exec'ing "/usr/lib64/erlang/erts-5.10.4/bin/epmd" Wed Jun 4 20:18:09 2014 : epmd (16203) created 16206 Wed Jun 4 20:18:09 2014 : epmd (16203) is exiting Wed Jun 4 20:18:09 2014 : epmd (16206) created 16209 Wed Jun 4 20:18:09 2014 : epmd (16206) is exiting Wed Jun 4 20:18:09 2014 : epmd (16209) is exiting Wed Jun 4 20:18:09 2014 : sh (16247) is exec'ing "/usr/lib64/erlang/erts-5.10.4/bin/epmd" Wed Jun 4 20:18:09 2014 : epmd (16247) created 16248 Wed Jun 4 20:18:09 2014 : epmd (16247) is exiting Wed Jun 4 20:18:09 2014 : epmd (16248) created 16250 Wed Jun 4 20:18:09 2014 : epmd (16248) is exiting Wed Jun 4 20:18:09 2014 : epmd (16250) is exiting SIGKILL was sent to epmd (pid:16205) by systemd (pid:1) uid:0 Wed Jun 4 20:18:12 2014 : epmd (16205) is exiting
I think this is resulting from a combination of:
- rabbitmq-server is of Type=simple in the systemd unit file - It double forks off epmd, so it's no longer parented to the main process - The unit file has an ExecStartPost line
Systemd does not allow long-running processes to be started from ExecStartPre/ExecStartPost so it will purposefully try to kill off anything that is hanging around. It's kinda hard to see in the above log, but the epmd daemon gets started by rabbitmq-server[1], then the ExecStartPost runs to wait on the pidfile, and then finally systemd sends SIGKILL to the epmd process. I believe systemd thinks the epmd process is from the ExecStartPost. To further this theory, if I comment out the ExecStartPost line, systemd does *not* send SIGKILL and everything works as expected.
Fortunately, I think the correct fix is to make sure bug 1103524 gets fixed as soon as possible. If the service type is changed to notify, then both the ExecStartPre and ExecStartPost lines can be removed, thus avoiding the undesirable kill behavior.
[1] I've explicitly commented out the ExecStartPre cookie race hack in my testing
https://bugzilla.redhat.com/show_bug.cgi?id=1104843
Fabio Massimo Di Nitto fdinitto@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Blocks|1083890 |
https://bugzilla.redhat.com/show_bug.cgi?id=1104843
--- Comment #2 from Peter Lemenkov lemenkov@gmail.com --- Hello All. A small update on this - both F-20/Rawhide and EPEL7 Erlang builds are now containing epmd.socket file for socket-activated epmd. It's still not enabled by default though (just because I didn't dig into details regarding systemd presets).
I believe we should add the following lines to rabbitmq-server.service
After=epmd.socket Requires=epmd.socket
Also I believe we can drop ExecStartPre now.
https://bugzilla.redhat.com/show_bug.cgi?id=1104843
--- Comment #3 from John Eckersberg jeckersb@redhat.com --- (In reply to Peter Lemenkov from comment #2)
Hello All. A small update on this - both F-20/Rawhide and EPEL7 Erlang builds are now containing epmd.socket file for socket-activated epmd. It's still not enabled by default though (just because I didn't dig into details regarding systemd presets).
I believe we should add the following lines to rabbitmq-server.service
After=epmd.socket Requires=epmd.socket
Also I believe we can drop ExecStartPre now.
This does work, after making some tweaks.
As is, rabbitmq-server fails to start because epmd is set to bind only to localhost, and rabbit is trying to connect on the "public" interface. Localhost is explicitly set in epmd.socket:
ListenStream=127.0.0.1:4369
And fails with the error:
Jun 13 21:03:08 jeckersb-f20 systemd[1]: Starting RabbitMQ broker... Jun 13 21:03:09 jeckersb-f20 rabbitmqctl[1186]: Waiting for 'rabbit@jeckersb-f20' ... Jun 13 21:03:09 jeckersb-f20 rabbitmqctl[1186]: pid is 1185 ... Jun 13 21:03:09 jeckersb-f20 rabbitmq-server[1185]: ERROR: epmd error for host "jeckersb-f20": address (cannot connect to host/port) Jun 13 21:03:09 jeckersb-f20 systemd[1]: rabbitmq-server.service: main process exited, code=exited, status=1/FAILURE Jun 13 21:03:09 jeckersb-f20 rabbitmqctl[1186]: Error: process_not_running Jun 13 21:03:09 jeckersb-f20 systemd[1]: rabbitmq-server.service: control process exited, code=exited status=2 Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: Stopping and halting node 'rabbit@jeckersb-f20' ... Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: Error: unable to connect to node 'rabbit@jeckersb-f20': nodedown Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: DIAGNOSTICS Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: =========== Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: nodes in question: ['rabbit@jeckersb-f20'] Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: hosts, their running nodes and ports: Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: - unable to connect to epmd on jeckersb-f20: address (cannot connect to host/port) Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: current node details: Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: - node name: 'rabbitmqctl1253@jeckersb-f20' Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: - home dir: /var/lib/rabbitmq Jun 13 21:03:10 jeckersb-f20 rabbitmqctl[1253]: - cookie hash: pjjhwhUNJ+O/cAmSUbP89w== Jun 13 21:03:10 jeckersb-f20 systemd[1]: rabbitmq-server.service: control process exited, code=exited status=2 Jun 13 21:03:10 jeckersb-f20 systemd[1]: Failed to start RabbitMQ broker. Jun 13 21:03:10 jeckersb-f20 systemd[1]: Unit rabbitmq-server.service entered failed state.
I tried changing this to be:
ListenStream=4369
Since that sounds like a sane dual-stack option, from systemd.socket(5):
"If the address string is a single number, it is read as port number to listen on via IPv6. Depending on the value of BindIPv6Only= (see below) this might result in the service being available via both IPv6 and IPv4 (default) or just via IPv6."
However this causes rabbitmq-server to crash like so:
Jun 13 21:10:20 jeckersb-f20 systemd[1]: Starting RabbitMQ broker... Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},"Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",epmd_close]} Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.21.0>},{registered_name,[]},{error_info,{exit,{error,b Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,un Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,{shutdown,{failed_to_start_child,net_ker Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},crash_report,[[{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0.9.0> Jun 13 21:10:21 jeckersb-f20 rabbitmqctl[1587]: {error_logger,{{2014,6,13},{21,10,21}},std_info,[{application,kernel},{exited,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{ Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},"Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",epmd_close]} Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.21.0>},{registered_name,[]},{error_info,{exit,{err Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pi Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,{shutdown,{failed_to_start_child,net Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},crash_report,[[{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0. Jun 13 21:10:21 jeckersb-f20 rabbitmq-server[1586]: {error_logger,{{2014,6,13},{21,10,21}},std_info,[{application,kernel},{exited,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kern Jun 13 21:10:22 jeckersb-f20 rabbitmqctl[1587]: {"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_ker Jun 13 21:10:22 jeckersb-f20 rabbitmq-server[1586]: {"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1587]: Crash dump was written to: erl_crash.dump Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1587]: Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kern Jun 13 21:10:23 jeckersb-f20 systemd[1]: rabbitmq-server.service: control process exited, code=exited status=1 Jun 13 21:10:23 jeckersb-f20 rabbitmq-server[1586]: Crash dump was written to: erl_crash.dump Jun 13 21:10:23 jeckersb-f20 rabbitmq-server[1586]: Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_ Jun 13 21:10:23 jeckersb-f20 systemd[1]: rabbitmq-server.service: main process exited, code=exited, status=1/FAILURE Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},"Protocol: ~tp: register/listen error: ~tp~n",["inet_tcp",epmd_close]} Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},crash_report,[[{initial_call,{net_kernel,init,['Argument__1']}},{pid,<0.21.0>},{registered_name,[]},{error_info,{exit,{error,b Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},supervisor_report,[{supervisor,{local,net_sup}},{errorContext,start_error},{reason,{'EXIT',nodistribution}},{offender,[{pid,un Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},supervisor_report,[{supervisor,{local,kernel_sup}},{errorContext,start_error},{reason,{shutdown,{failed_to_start_child,net_ker Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},crash_report,[[{initial_call,{application_master,init,['Argument__1','Argument__2','Argument__3','Argument__4']}},{pid,<0.9.0> Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {error_logger,{{2014,6,13},{21,10,23}},std_info,[{application,kernel},{exited,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kernel,{ Jun 13 21:10:23 jeckersb-f20 rabbitmqctl[1658]: {"Kernel pid terminated",application_controller,"{application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_ker Jun 13 21:10:24 jeckersb-f20 rabbitmqctl[1658]: Crash dump was written to: erl_crash.dump Jun 13 21:10:24 jeckersb-f20 rabbitmqctl[1658]: Kernel pid terminated (application_controller) ({application_start_failure,kernel,{{shutdown,{failed_to_start_child,net_sup,{shutdown,{failed_to_start_child,net_kern Jun 13 21:10:24 jeckersb-f20 systemd[1]: rabbitmq-server.service: control process exited, code=exited status=1 Jun 13 21:10:24 jeckersb-f20 systemd[1]: Failed to start RabbitMQ broker. Jun 13 21:10:24 jeckersb-f20 systemd[1]: Unit rabbitmq-server.service entered failed state.
Finally, I forced it to listen IPv4 only, on all interfaces with:
ListenStream=0.0.0.0:4369
And this works as expected. So I agree that we should use the systemd-managed epmd instance with all of its socket activation goodness, but only after the fix above.
https://bugzilla.redhat.com/show_bug.cgi?id=1104843
--- Comment #4 from John Eckersberg jeckersb@redhat.com --- One other note on the standalone epmd. I used the templated socket by adding:
Requires: epmd@0.0.0.0.socket
to the rabbitmq-server unit file, which works perfectly :)
https://bugzilla.redhat.com/show_bug.cgi?id=1104843 Bug 1104843 depends on bug 1104193, which changed state.
Bug 1104193 Summary: rabbitmqctl doesn't work https://bugzilla.redhat.com/show_bug.cgi?id=1104193
What |Removed |Added ---------------------------------------------------------------------------- Status|RELEASE_PENDING |CLOSED Resolution|--- |ERRATA
https://bugzilla.redhat.com/show_bug.cgi?id=1104843
--- Comment #5 from Peter Lemenkov lemenkov@gmail.com --- (In reply to John Eckersberg from comment #4)
One other note on the standalone epmd. I used the templated socket by adding:
Requires: epmd@0.0.0.0.socket
to the rabbitmq-server unit file, which works perfectly :)
I think we should finally add this to the service-file. John, are you aware of any issues with this additional Reqires so far?
My only concern is that if the administrator stops rabbitmq-server.service then epmd@0.0.0.0.socket still remains active. So this socket (tcp:0.0.0.0:4369) will remain opened.
I suspect we could spice it up with some systemd magic (StopWhenUnneeded + BindsTo + RefuseManualStart) but so far I cant figure out how to do it properly. See my question in devel@ list:
* http://thread.gmane.org/gmane.linux.redhat.fedora.devel/201512
I suppose this issue is described here:
* https://bugzilla.redhat.com/1104199
Citing my own private email:
==================== Although this might be related to pre-systemd version of RabbitMQ it raises a valid question. It seems that people don't like the idea of epmd remaining online after the RabbitMQ shutdown. Frankly speaking it isn't a bug - what to do with another systemd-service started as a dependency is up to systemd administrator. However I believe this should be adjusted by adding some combination of StopWhenUnneeded + BindsTo + RefuseManualStart to epmd@.socket / epmd@.service. I've tried quickly but failed to implement a required functionality.
So far I have the following considerations:
* epmd@.service must refuse manual activation. It can be activated only by the corresponding socket unit. So maybe we have to add RefuseManualStart to the service. Quite the contrary the socket unit counterpart can be manually activated.
* Every Erlang service (RabbitMQ) must require socket-file. Something like Requires=epmd@0.0.0.0.socket. Otherwise they will try to start their own epmd instance.
* epmd@.service and epmd@.socket must be deactivated right after the dependent services stopped.
As I said earlier I failed to glue all of these together and would love to hear any comments / suggestions and maybe receive a bit of systemd-related help :) ====================
https://bugzilla.redhat.com/show_bug.cgi?id=1104843
Peter Lemenkov lemenkov@gmail.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Version|20 |rawhide
https://bugzilla.redhat.com/show_bug.cgi?id=1104843
--- Comment #6 from John Eckersberg jeckersb@redhat.com --- I think Lennart described in the fedora-devel thread why we can't really shut down epmd after it has been started via socket activation.
The problem here is that epmd, however it gets launched, is going to be shared among all of the erlang VMs running on a server. So you have one of two cases:
(1) is what we have now, where each service as part of booting its VM will check for epmd, and start it if it is not present. That's a problem though. Consider you have two erlang services, A and B. You start A, it starts epmd as part of its startup. Now the epmd process is running under the cgroup of service A. Then you start service B, it detects epmd is already running and starts using it. Then service A is stopped and all of the processes under the cgroup, including epmd. Now process B doesn't have an epmd to talk to, and you will have to restart it in order for a new one to get spawned.
(2) is this proposed case, where epmd is its own standalone service, running in its own cgroup, under socket activation. This avoids the mess from (1), with the only downside that I see is that it does not automatically stop (it can't).
IMO, the downside to (1) is much worse than the downside to (2). If the user/administrator does not want epmd to be running after rabbitmq-server is stopped, I don't think it's unreasonable to require it to be explicitly stopped.
https://bugzilla.redhat.com/show_bug.cgi?id=1104843
Peter Lemenkov plemenko@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Keywords| |FutureFeature CC| |plemenko@redhat.com Version|22 |rawhide Blocks| |1104199 Summary|rabbitmqctl doesn't work |RabbitMQ server shouldn't | |leave epmd running after | |being stopped
Red Hat Bugzilla bugzilla@redhat.com changed:
What |Removed |Added ---------------------------------------------------------------------------- Doc Type|Bug Fix |Enhancement
--- Comment #8 from Peter Lemenkov plemenko@redhat.com --- Ok, adjusting title of the ticket according to the remainign issue and adding a dependent RHOS-related ticket.
Referenced Bugs:
https://bugzilla.redhat.com/show_bug.cgi?id=1104199 [Bug 1104199] systemd unit file doesn't kill all rabbitmq processes
erlang@lists.fedoraproject.org