[Bug 1138213] New: Rabbitmq cluster remains partitioned after short network partition incident

Thursday, 4 September 2014

https://bugzilla.redhat.com/show_bug.cgi?id=1138213

            Bug ID: 1138213
           Summary: Rabbitmq cluster remains partitioned after short
                    network partition incident
           Product: Fedora
           Version: 20
         Component: rabbitmq-server
          Assignee: lemenkov(a)gmail.com
          Reporter: jprovazn(a)redhat.com
        QA Contact: extras-qa(a)fedoraproject.org
                CC: erlang(a)lists.fedoraproject.org,
                    hubert.plociniczak(a)gmail.com, jeckersb(a)redhat.com,
                    lemenkov(a)gmail.com, rjones(a)redhat.com, s(a)shk.io

Description of problem:
In 3-node rabbitmq cluster after network partition happens and then when 3
nodes can communicate again, then cluster remains partitioned if
"pause_minority" policy is used.

This happens if network outage is short enough (~60secs), for longer outages
(>3 minutes) cluster was reconstructed properly. Very similar issue was
reported here:
http://lists.rabbitmq.com/pipermail/rabbitmq-discuss/2014-March/034639.html

This issue seems to be fixed in upstream already (not sure in which version
exactly, but 3.3.5 works fine).

Version-Release number of selected component (if applicable):
rabbitmq-server-3.1.5-9.fc20.noarch

Steps to Reproduce:
1. create 3-node cluster, use "pause_minority" no-quorum policy
2. stop networking on one of nodes and wait until the node is seen as shutdown
on other nodes
3. start networking on the node again

Actual results:
Cluster remains partitioned:

[root@overcloud-controller1-csugetg5pjql ~]# rabbitmqctl cluster_status
Cluster status of node 'rabbit@overcloud-controller1-csugetg5pjql' ...
[{nodes,
     [{disc,
          ['rabbit@overcloud-controller0-o6yt2gtaxk6g',
           'rabbit@overcloud-controller1-csugetg5pjql',
           'rabbit@overcloud-controller2-z3tswnamdzhq']}]},
 {running_nodes,
     ['rabbit@overcloud-controller0-o6yt2gtaxk6g',
      'rabbit@overcloud-controller1-csugetg5pjql']},
 {partitions,
     [{'rabbit@overcloud-controller0-o6yt2gtaxk6g',
          ['rabbit@overcloud-controller2-z3tswnamdzhq']},
      {'rabbit@overcloud-controller1-csugetg5pjql',
          ['rabbit@overcloud-controller2-z3tswnamdzhq']}]}]
...done.

Expected results:
no partitiones after the node is up again

Additional info:
only related message in log:
Mnesia('rabbit@overcloud-controller0-o6yt2gtaxk6g'): ** ERROR ** mnesia_event
got {inconsistent_database, running_partitioned_network,
'rabbit@overcloud-controller2-z3tswnamdzhq'}

-- 
You are receiving this mail because:
You are on the CC list for the bug.

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

[Bug 1138213] New: Rabbitmq cluster remains partitioned after short network partition incident