serial port issues on IBM xseries with FC4 and High Availability heartbeat
John Wendel
john.wendel at metnet.navy.mil
Fri May 26 21:02:26 UTC 2006
Rick Stevens wrote:
> On Fri, 2006-05-26 at 13:56 -0400, Bob Chiodini wrote:
>> On Fri, 2006-05-26 at 10:22 -0700, Rick Stevens wrote:
>>> On Fri, 2006-05-26 at 09:27 -0400, Randy Grimshaw wrote:
>>>> I am trying to run a linux high availability cluster (failover pair)
>>>> using serial as one of the heartbeats.
>>>>
>>>> Due to numerous serial over-runs the systems are actually crashing
>>>> periodically.
>>>>
>>>> This is a very frustrating development for a system intended to provide
>>>> HA. (certainly not ha ha ha).
>>>>
>>>> I have updated to the latest bios.
>>>> I have checked RTS DTS XON XOFF etc.
>>>> This is happening with the stock and custom kernels.
>>>> This is happening on three pairs of servers.
>>>> The serial ports are detected as:
>>>> Serial: 8250/16550 driver $Revision: 1.90 $ 32 ports, IRQ
>>>> sharing enabled
>>>> serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
>>>>
>>>>
>>>> Any advice would be greatly appreciated.
>>> The most common problem with overruns is running too high a baud rate.
>>> Remember, 16550s only have a 16-byte buffer in them. At 38,400 baud,
>>> you'll fill that buffer in about 260 microseconds. 9600 baud will fill
>>> the buffer in a tiny bit over 1 millisecond. Flow control tries to
>>> prevent overflows.
>>>
>>> Without flow control and if the machine is busy, the interrupt from the
>>> chip may not be serviced in time and you'll miss data because you've
>>> filled the buffer. Dropping the baud rate down should help, and make
>>> sure you use hardware (RTS/CTS) flow control. Remember that software
>>> (XON/XOFF) flow control requires the CPU to watch the buffer and send an
>>> XOFF when it gets full. You're already overrunning the buffer...
>>> software flow control won't help.
>>>
>>> Heartbeat stuff between nodes in a cluster is NOT a place to try to
>>> scrimp and save money! NICs are relatively cheap after all, they have
>>> much bigger buffers in them and they use DMA to transfer data to the
>>> processor instead of one-byte-at-a-time over the I/O ports. Frankly,
>>> NICS are far more reliable--especially for something this critical.
>>>
>> At 8N1 and 38400 bits/second, that would be 3840 bytes/second or 240
>> "FIFO fills" per second or 4.17 mS to fill the entire FIFO.
>
> Oops! Yup, 260 microseconds/character. Forgot to multiply by 16.
> Doh! :-( Hey! It's Friday and it's been a l-o-n-g week.
>
>> It sounds like something more is broken here. My old 486 running Linux
>> seemed to do better than that.
>
> I don't think so. Think how choppy things can get on a terminal
> emulation when the machine gets busy. Besides, the OP mentioned
> overruns and I think that's just what he's seeing--the FIFO is getting
> swamped before the CPU processes the interrupt.
>
>> The HA serial data rate is pretty low and should not be a problem.
>
> Not if it's bursty. If it's over 16 bytes and the machine doesn't
> service the interrupt...bad things will happen.
>
>> Can minicom transfer a file between the two servers via the serial
>> ports?
>
> Better, "can it do it while you're really flogging the CPU somehow?"
Shouldn't RTS/CTS flow control take care of this problem? Are the
correct pins wired in your serial cable?
And just how many bytes do you need to implement a heartbeat? Seems
like 1 a second would get the job done.
Regards,
John
More information about the users
mailing list