Kernel bug or disk failure

Todd Denniston Todd.Denniston at ssa.crane.navy.mil
Mon Jul 14 13:44:55 UTC 2008


Sam Varshavchik wrote, On 07/13/2008 10:51 AM:
> Chris Snook writes:
> 
>> Sam Varshavchik wrote:
>>> Every other week or so, I get a disk kicked out of my RAID, with this:
>>>
>>> Jul  6 04:05:38 commodore kernel: (scsi1:A:0:0): scsi1: device 
>>> overrun (status 10) on 0:0:0
>>> Jul  6 04:05:38 commodore kernel: Unexpected busfree in DT Data-in 
>>> phase, 1 SCBs aborted, PRGMCNT == 0x22f
>>> Jul  6 04:05:38 commodore kernel: >>>>>>>>>>>>>>>>>> Dump Card State 
>>> Begins <<<<<<<<<<<<<<<<<
>>> Jul  6 04:05:38 commodore kernel: scsi1: Dumping Card State at 
>>> program address 0x22d Mode 0x22
>>> Jul  6 04:05:38 commodore kernel: Card was paused
>>>
>>> … followed by a rather dry dump of the HBA's registers. This is 
>>> aic79xxx.
>>>
>>> This does not look like a disk error to me. I re-add the drive into 
>>> the array, and rebuild with no downtime. SMART shows 0 in the defect 
>>> list on this drive, and over the disk's lifetime 0 uncorrectable 
>>> reads and 1 uncorrectable write -- but this kernel barf already 
>>> happened 4-5 times now, and it's getting rather annoying.
>>>
>>
>> Looks more like a controller problem than a drive problem.  Do you 
>> have a spare HBA to test?
> 
> No, but I have one on order, now. I reseated the cable, that didn't help 
> -- the card dumped again about 12 hours later, but it was, apparently, 
> non-fatal because RAID did not degrade.
> 

May I suggest that, when it is convenient to do so, you:
1) reboot
2) Catch the scsi card ( Ctrl-A ) when the aic79xxx boot text shows up during 
bios operations.
3) set the speed of the scsi bus to that drive to a little slower.
4) if you get the fault or the drive is not recognized, repeat until you get a 
desired result (some drives do not work at ALL the speeds slower than it is 
rated at, Promise U160 rated array communicated only at 160, 80, 66, 16 & 6).

I had to work with a Promise array which was
a) a bit flaky even compared to it's twin in the other bay (too late to 
warranty either one when I arrived).
b) had Promise's problem of not knowing how to do domain validation, so I had 
to turn that off (domain validation only made the arrays flake out sooner).
c) could not work reliably above ~20MB/sec (write or read).
d) dropping similar errors to what you have above in ~4-8 hours of operation.
Slowing it down using the card's settings made it work reliably enough to get 
the job done.

-- 
Todd Denniston
Crane Division, Naval Surface Warfare Center (NSWC Crane)
Harnessing the Power of Technology for the Warfighter




More information about the users mailing list