http://en.wikipedia.org/wiki/Hostname#Restrictions_on_valid_host_names
"While a hostname may not contain other characters, such as the
underscore character (_)..."
"A notable example of non-compliance with this specification, Microsoft
Windows systems often use underscores in hostnames. Since some systems
will reject invalid hostnames while others will not, the use of invalid
hostname characters may cause subtle problems in systems that connect to
standards-based services. For example, RFC-compliant mail servers will
refuse to deliver mail for MS Windows computers with names containing
underscores."
"The Internet standards (Request for Comments) for protocols mandate
that component hostname labels may contain only the ASCII letters 'a'
through 'z' (in a case-insensitive manner), the digits '0' through
'9',
and the hyphen ('-'). The original specification of hostnames in RFC
952, mandated that labels could not start with a digit or with a hyphen,
and must not end with a hyphen. However, a subsequent specification (RFC
1123) permitted hostname labels to start with digits. *No other symbols,
punctuation characters, or white space are permitted.*" (emphasis added)
On 10/15/2010 02:13 PM, Bala Nair wrote:
> So, I think we found the problem - server B public endpoint address
> was mmc_int2 and the agent machine's /etc/hosts file mapped that name to
> the correct ip address. We changed the endpoint to the ip address of
> the server and restarted the agent. Failover started working
> correctly. We then changed the endpoint to mmc-int2, changed the hosts
> file on the agent box and restarted the agent. Once again everything
> worked fine. So I switched everything back to mmc_int2 just to verify
> that it was a name issue and failover stopped working again. So it
> looks like an underscore in an endpoint name causes connection failure.
> Is there something about the naming rules that I'm just not aware of or
> is this really a bug?
>
> Bala Nair
> SeaChange International
>
>
> On 10/15/10 12:45 PM, John Mazzitelli wrote:
>> [I'm sending this to the rhq-users list - it is more appropriate there
>> - the rhq-devel list is for developers of the RHQ codebase]
>>
>>> First why is it trying to fail over to localhost instead of server B
>>> and second why the connection refused error? There is no rhq server
>>> on this agent box to refuse a connection.
>> I suspect that this is because your server B might not have its public
>> endpoint declared correctly. Go to Administration>Servers and look at
>> Server B - what is its public endpoint address? Make sure it is correct.
>>
>> On your agents, what do the failover lists look like? I suspect you
>> will see "127.0.0.1" in the failover list, as opposed to the server B
>> host/IP you expect.
>>
>> Read this as background:
>>
>>
http://rhq-project.org/display/JOPR2/High+Availability#HighAvailability-S...
>>
>>
>>
http://rhq-project.org/display/JOPR2/High+Availability#HighAvailability-F...
>>
>>
>> -------- Original Message --------
>> Subject: can't get ha failover to work
>> Date: Fri, 15 Oct 2010 12:39:25 -0400
>> From: Bala Nair<bnairtm(a)comcast.net>
>> Reply-To: rhq-devel(a)lists.fedorahosted.org
>> To: rhq-devel(a)lists.fedorahosted.org
>>
>> We're trying to set up an rhq HA cloud with 2 servers and 4 agents and
>> we're having a problem getting the agents to failover to the second
>> server. When we first start up everything all the agents are connected
>> to one server (call it server A) with the other server (server B) not
>> connected to any agents. The failover list on the agent side showing 2
>> entries (server B and server A in that order). We go to the HA servers
>> page in the gui and see both servers are in NORMAL mode with server A
>> having an agent count of 4 and server B a count of 0. There are no
>> affinity groups. We then set server A to MAINTENANCE mode and wait. I
>> expect the 4 agents connected to server A to failover to server B and to
>> see that in the servers list, but nothing changes. Checking the agent
>> logs I find the following errors:
>>
>> 2010-10-15 11:17:46,653 INFO [RHQ Server Polling Thread]
>> (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)-
>>
>> {JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is
>> changing endpoint from [InvokerLocator
>> [servlet://mmc-int:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
>>
>> to [InvokerLocator
>> [servlet://127.0.0.1/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
>>
>>
>> 2010-10-15 11:17:46,654 WARN [RHQ Server Polling Thread]
>> (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed
>> to failover to another server. Cause:
>> org.jboss.remoting.CannotConnectException: Can not connect http client
>> invoker. Connection refused.
>>
>> 2010-10-15 11:17:46,658 INFO [RHQ Server Polling Thread]
>> (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)-
>>
>> {JBossRemotingRemoteCommunicator.changing-endpoint}Communicator is
>> changing endpoint from [InvokerLocator
>> [servlet://127.0.0.1/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
>>
>> to [InvokerLocator
>> [servlet://mmc-int:7080/jboss-remoting-servlet-invoker/ServerInvokerServlet]]
>>
>>
>> 2010-10-15 11:17:46,661 WARN [RHQ Server Polling Thread]
>> (org.rhq.enterprise.agent.AgentMain)- {AgentMain.failover-failed}Failed
>> to failover to another server. Cause:
>> org.rhq.enterprise.communications.util.NotProcessedException
>>
>> 2010-10-15 11:17:46,663 WARN [RHQ Server Polling Thread]
>> (org.rhq.enterprise.agent.FailoverFailureCallback)-
>> {AgentMain.too-many-failover-attempts}Too many failover attempts have
>> been made [2]. Exception that triggered the failover:
>> [org.rhq.enterprise.communications.util.NotProcessedException]
>>
>> 2010-10-15 11:17:46,663 ERROR [RHQ Server Polling Thread]
>> (enterprise.communications.command.client.JBossRemotingRemoteCommunicator)-
>>
>> {JBossRemotingRemoteCommunicator.init-callback-failed}The initialize
>> callback has failed. It will be tried again. Cause:
>> org.rhq.enterprise.communications.util.NotProcessedException:null.
>> Cause: org.rhq.enterprise.communications.util.NotProcessedException
>>
>>
>> In this case mmc-int is server A. I can understand the second series of
>> errors where it tries to fail back to mmc-int and fails because mmc-int
>> is in maintenance mode. I don't understand the initial failure though.
>> First why is it trying to fail over to localhost instead of server B and
>> second why the connection refused error? There is no rhq server on this
>> agent box to refuse a connection.
>>
>> I have looked through all the agent and server configuration properties
>> and I just don't see how the localhost address is getting set in this
>> case. Any help would be appreciated. Thanks.
>>
>> Bala Nair
>> SeaChange International
>>
>> _______________________________________________
>> rhq-devel mailing list
>> rhq-devel(a)lists.fedorahosted.org
>>
https://fedorahosted.org/mailman/listinfo/rhq-devel
>>
> _______________________________________________
> rhq-users mailing list
> rhq-users(a)lists.fedorahosted.org
>
https://fedorahosted.org/mailman/listinfo/rhq-users
_______________________________________________
rhq-users mailing list
rhq-users(a)lists.fedorahosted.org
https://fedorahosted.org/mailman/listinfo/rhq-users