serial port issues on IBM xseries with FC4 and High Availability heartbeat

John Wendel john.wendel at metnet.navy.mil
Fri May 26 21:02:26 UTC 2006


Rick Stevens wrote:
> On Fri, 2006-05-26 at 13:56 -0400, Bob Chiodini wrote:
>> On Fri, 2006-05-26 at 10:22 -0700, Rick Stevens wrote:
>>> On Fri, 2006-05-26 at 09:27 -0400, Randy Grimshaw wrote:
>>>> I am trying to run a linux high availability cluster (failover pair)
>>>> using serial as one of the heartbeats.
>>>>
>>>> Due to numerous serial over-runs the systems are actually crashing
>>>> periodically.
>>>>
>>>> This is a very frustrating development for a system intended to provide
>>>> HA. (certainly not ha ha ha).
>>>>
>>>> I have updated to the latest bios.
>>>> I have checked RTS DTS XON XOFF etc.
>>>> This is happening with the stock and custom kernels.
>>>> This is happening on three pairs of servers.
>>>> The serial ports are detected as:
>>>>        Serial: 8250/16550 driver $Revision: 1.90 $ 32 ports, IRQ
>>>> sharing enabled
>>>>        serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
>>>>
>>>>
>>>> Any advice would be greatly appreciated.
>>> The most common problem with overruns is running too high a baud rate.
>>> Remember, 16550s only have a 16-byte buffer in them.  At 38,400 baud,
>>> you'll fill that buffer in about 260 microseconds.  9600 baud will fill
>>> the buffer in a tiny bit over 1 millisecond.  Flow control tries to
>>> prevent overflows.
>>>
>>> Without flow control and if the machine is busy, the interrupt from the
>>> chip may not be serviced in time and you'll miss data because you've
>>> filled the buffer.  Dropping the baud rate down should help, and make
>>> sure you use hardware (RTS/CTS) flow control.  Remember that software
>>> (XON/XOFF) flow control requires the CPU to watch the buffer and send an
>>> XOFF when it gets full.  You're already overrunning the buffer...
>>> software flow control won't help.
>>>
>>> Heartbeat stuff between nodes in a cluster is NOT a place to try to
>>> scrimp and save money!  NICs are relatively cheap after all, they have
>>> much bigger buffers in them and they use DMA to transfer data to the
>>> processor instead of one-byte-at-a-time over the I/O ports.  Frankly,
>>> NICS are far more reliable--especially for something this critical.
>>>
>> At 8N1 and 38400 bits/second, that would be 3840 bytes/second or 240
>> "FIFO fills" per second or 4.17 mS to fill the entire FIFO.
> 
> Oops!  Yup, 260 microseconds/character.  Forgot to multiply by 16.
> Doh! :-(  Hey!  It's Friday and it's been a l-o-n-g week.
> 
>> It sounds like something more is broken here.  My old 486 running Linux
>> seemed to do better than that.
> 
> I don't think so.  Think how choppy things can get on a terminal
> emulation when the machine gets busy.  Besides, the OP mentioned
> overruns and I think that's just what he's seeing--the FIFO is getting
> swamped before the CPU processes the interrupt.
> 
>> The HA serial data rate is pretty low and should not be a problem.
> 
> Not if it's bursty.  If it's over 16 bytes and the machine doesn't
> service the interrupt...bad things will happen.
> 
>> Can minicom transfer a file between the two servers via the serial
>> ports?
> 
> Better, "can it do it while you're really flogging the CPU somehow?"

Shouldn't RTS/CTS flow control take care of this problem? Are the 
correct pins wired in your serial cable?

And just how many bytes do you need to implement a heartbeat? Seems 
like 1 a second would get the job done.

Regards,

John







More information about the users mailing list