Fwd: can't get ha failover to work

Friday, 15 October 2010

[I'm sending this to the rhq-users list - it is more appropriate there - 
the rhq-devel list is for developers of the RHQ codebase]

...
 First why is it trying to fail over to localhost instead of server B
 and second why the connection refused error?  There is no rhq server
 on this agent box to refuse a connection. 
I suspect that this is because your server B might not have its public 
endpoint declared correctly. Go to Administration>Servers and look at 
Server B - what is its public endpoint address? Make sure it is correct.

On your agents, what do the failover lists look like? I suspect you will 
see "127.0.0.1" in the failover list, as opposed to the server B host/IP 
you expect.

Read this as background:

http://rhq-project.org/display/JOPR2/High+Availability#HighAvailability-S...

http://rhq-project.org/display/JOPR2/High+Availability#HighAvailability-F...

-------- Original Message --------
Subject: can't get ha failover to work
Date: Fri, 15 Oct 2010 12:39:25 -0400
From: Bala Nair <bnairtm(a)comcast.net&gt;
Reply-To: rhq-devel(a)lists.fedorahosted.org
To: rhq-devel(a)lists.fedorahosted.org

   We're trying to set up an rhq HA cloud with 2 servers and 4 agents and
we're having a problem getting the agents to failover to the second
server.  When we first start up everything all the agents are connected
to one server (call it server A) with the other server (server B) not
connected to any agents. The failover list on the agent side showing 2
entries (server B and server A in that order).  We go to the HA servers
page in the gui and see both servers are in NORMAL mode with server A
having an agent count of 4 and server B a count of 0.  There are no
affinity groups.  We then set server A to MAINTENANCE mode and wait.  I
expect the 4 agents connected to server A to failover to server B and to
see that in the servers list, but nothing changes.  Checking the agent
logs I find the following errors:

2010-10-15 11:17:46,653 INFO  [RHQ Server Polling Thread]
(enterprise.communications.command.client.JBossRemotingRemoteCommunicator)-
{JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is
changing endpoint from [InvokerLocator
[servlet://mmc-int:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]] 

to [InvokerLocator
[servlet://127.0.0.1/jboss-remoting-servlet-invoker/ServerInvokerServlet]]

2010-10-15 11:17:46,654 WARN  [RHQ Server Polling Thread]
(org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed
to failover to another server. Cause:
org.jboss.remoting.CannotConnectException: Can not connect http client
invoker. Connection refused.

2010-10-15 11:17:46,658 INFO  [RHQ Server Polling Thread]
(enterprise.communications.command.client.JBossRemotingRemoteCommunicator)-
{JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is
changing endpoint from [InvokerLocator
[servlet://127.0.0.1/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
to [InvokerLocator
[servlet://mmc-int:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]

2010-10-15 11:17:46,661 WARN  [RHQ Server Polling Thread]
(org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed
to failover to another server. Cause:
org.rhq.enterprise.communications.util.NotProcessedException

2010-10-15 11:17:46,663 WARN  [RHQ Server Polling Thread]
(org.rhq.enterprise.agent.FailoverFailureCallback)-
{AgentMain.too-many-failover-attempts}Too many failover attempts have
been made [2]. Exception that triggered the failover:
[org.rhq.enterprise.communications.util.NotProcessedException]

2010-10-15 11:17:46,663 ERROR [RHQ Server Polling Thread]
(enterprise.communications.command.client.JBossRemotingRemoteCommunicator)-
{JBossRemotingRemoteCommunicator.init-callback-failed}The initialize
callback has failed. It will be tried again. Cause:
org.rhq.enterprise.communications.util.NotProcessedException:null.
Cause: org.rhq.enterprise.communications.util.NotProcessedException

In this case mmc-int is server A.  I can understand the second series of
errors where it tries to fail back to mmc-int and fails because mmc-int
is in maintenance mode.  I don't understand the initial failure though.
First why is it trying to fail over to localhost instead of server B and
second why the connection refused error?  There is no rhq server on this
agent box to refuse a connection.

I have looked through all the agent and server configuration properties
and I just don't see how the localhost address is getting set in this
case.  Any help would be appreciated.  Thanks.

Bala Nair
SeaChange International

_______________________________________________
rhq-devel mailing list
rhq-devel(a)lists.fedorahosted.org
https://fedorahosted.org/mailman/listinfo/rhq-devel

2024

2023

2022

2021

2020

2019

2018

2017

2016

2015

2014

2013

2012

2011

2010

2009