I have the 3Ware 9500S-4LP (running RAID5) installed in a Dell Poweredge 1800 (dual EM64T Xeon, 7520 chipset). With Fedora Core 4 installed, everything seems fine at first glance. However, after a period (many minutes) of heavy disk I/O operations, the machine begins to slow and then becomes completely unresponsive. I get "APIC error on CPUn: 00(40)" in dmesg which may or may not be related. Nothing is logged to syslog when the failures occur. I have tried booting with noapic and using the uP kernel. Some of these configurations make the APIC error go away, but the hangs still occur. This machine will run fine for days or more at high CPU loads as long as the I/O load to the RAID card is low.
Any thoughts? Anyone running this configuration?
Am Do, den 30.06.2005 schrieb Mark Miksis um 1:38:
I have the 3Ware 9500S-4LP (running RAID5) installed in a Dell Poweredge 1800 (dual EM64T Xeon, 7520 chipset). With Fedora Core 4 installed, everything seems fine at first glance. However, after a period (many minutes) of heavy disk I/O operations, the machine begins to slow and then becomes completely unresponsive. I get "APIC error on CPUn: 00(40)" in dmesg which may or may not be related. Nothing is logged to syslog when the failures occur. I have tried booting with noapic and using the uP kernel. Some of these configurations make the APIC error go away, but the hangs still occur. This machine will run fine for days or more at high CPU loads as long as the I/O load to the RAID card is low.
A known issue. Some weeks ago discussed on the CentOS list:
http://lists.centos.org/pipermail/centos/2005-May/005824.html
Don't know why the bugzilla ticket is restricted
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=121434
Alexander
Alexander Dalloz wrote:
Am Do, den 30.06.2005 schrieb Mark Miksis um 1:38:
I have the 3Ware 9500S-4LP (running RAID5) installed in a Dell Poweredge 1800 (dual EM64T Xeon, 7520 chipset). With Fedora Core 4 installed, everything seems fine at first glance. However, after a period (many minutes) of heavy disk I/O operations, the machine begins to slow and then becomes completely unresponsive. I get "APIC error on CPUn: 00(40)" in dmesg which may or may not be related. Nothing is logged to syslog when the failures occur. I have tried booting with noapic and using the uP kernel. Some of these configurations make the APIC error go away, but the hangs still occur. This machine will run fine for days or more at high CPU loads as long as the I/O load to the RAID card is low.
A known issue. Some weeks ago discussed on the CentOS list:
http://lists.centos.org/pipermail/centos/2005-May/005824.html
Don't know why the bugzilla ticket is restricted
Very interesting reading - thanks for the link. Some of those posts suggest that the newer 3Ware firmware may "improve" the issue. Otherwise, I guess I'm in the market for a different 4 channel RAID 5 SATA card. Any recommendations?
On Wednesday 29 June 2005 21:58, Mark Miksis wrote:
Very interesting reading - thanks for the link. Some of those posts suggest that the newer 3Ware firmware may "improve" the issue. Otherwise, I guess I'm in the market for a different 4 channel RAID 5 SATA card. Any recommendations?
Actually, I have some 9500s that are finally getting stable. Each new driver revision and each new firmware is better than the last and by now, I have no issues (also running the x86_64 but on amd hardware).
Anyone who asks me, usually gets a 8500 controller as recommendation because unlike the 9500s, those are actually rock solid. You should try the firmware updates and maybe the driver (if there is one newer than the one in the kernel you're running) and see if you can't get your box stable that way - as I said before, it worked for me.
Also, you might want to get an AMD system next time if you have heavy IO and more than 4GB ram - the AMD iommu requires no bounce buffers...
Peter.
Peter Arremann wrote:
Actually, I have some 9500s that are finally getting stable. Each new driver revision and each new firmware is better than the last and by now, I have no issues (also running the x86_64 but on amd hardware).
Thanks for the encouraging feedback. It appears that the driver in FC4 is a bit old (and doesn't seem to correspond to a 3Ware release). Are you running SMP?
Anyone who asks me, usually gets a 8500 controller as recommendation because unlike the 9500s, those are actually rock solid. You should try the firmware updates and maybe the driver (if there is one newer than the one in the kernel you're running) and see if you can't get your box stable that way - as I said before, it worked for me.
I plan to compile the latest driver tomorrow...
Also, you might want to get an AMD system next time if you have heavy IO and more than 4GB ram - the AMD iommu requires no bounce buffers...
Interesting. Actually, my IO needs are not *that* heavy, but I first encountered the problem when trying to restore a bunch of backups from my previous machine.
On Wednesday 29 June 2005 23:59, Mark Miksis wrote:
Peter Arremann wrote:
Actually, I have some 9500s that are finally getting stable. Each new driver revision and each new firmware is better than the last and by now, I have no issues (also running the x86_64 but on amd hardware).
Thanks for the encouraging feedback. It appears that the driver in FC4 is a bit old (and doesn't seem to correspond to a 3Ware release). Are you running SMP?
Both - running an Athlon64, 32bit PCI slot, 2 mirrored drives and a quad opteron, pci-x, 4 raid 0 drives.
Anyone who asks me, usually gets a 8500 controller as recommendation because unlike the 9500s, those are actually rock solid. You should try the firmware updates and maybe the driver (if there is one newer than the one in the kernel you're running) and see if you can't get your box stable that way - as I said before, it worked for me.
I plan to compile the latest driver tomorrow...
The driver in the FC4 kernel is 2.26.02.002 (grep -i version /lib/modules/2.6.11-1.1369_FC4/kernel/drivers/scsi/3w-9xxx.ko) while the driver that amcc has is 2.26.03.015fw (grep TW_DRIVER_VERSION 3w-9xxx.c) Its not that great a difference but it might be worth it for you. Don't forget to check the firmware upgrades as well.
Also, you might want to get an AMD system next time if you have heavy IO and more than 4GB ram - the AMD iommu requires no bounce buffers...
Interesting. Actually, my IO needs are not *that* heavy, but I first encountered the problem when trying to restore a bunch of backups from my previous machine.
High is relative... Sometimes a simple script can create a ton of IO for just a few seconds and that can be enough. :-)
Peter.
Peter Arremann wrote:
On Wednesday 29 June 2005 23:59, Mark Miksis wrote:
I plan to compile the latest driver tomorrow...
The driver in the FC4 kernel is 2.26.02.002 (grep -i version /lib/modules/2.6.11-1.1369_FC4/kernel/drivers/scsi/3w-9xxx.ko) while the driver that amcc has is 2.26.03.015fw (grep TW_DRIVER_VERSION 3w-9xxx.c) Its not that great a difference but it might be worth it for you. Don't forget to check the firmware upgrades as well.
I thought that when you update the driver, it automatically updates the firmware?
Mark Miksis wrote:
Alexander Dalloz wrote:
Am Do, den 30.06.2005 schrieb Mark Miksis um 1:38:
I have the 3Ware 9500S-4LP (running RAID5) installed in a Dell Poweredge 1800 (dual EM64T Xeon, 7520 chipset). With Fedora Core 4 installed, everything seems fine at first glance. However, after a period (many minutes) of heavy disk I/O operations, the machine begins to slow and then becomes completely unresponsive. I get "APIC error on CPUn: 00(40)" in dmesg which may or may not be related. Nothing is logged to syslog when the failures occur. I have tried booting with noapic and using the uP kernel. Some of these configurations make the APIC error go away, but the hangs still occur. This machine will run fine for days or more at high CPU loads as long as the I/O load to the RAID card is low.
A known issue. Some weeks ago discussed on the CentOS list:
http://lists.centos.org/pipermail/centos/2005-May/005824.html
Don't know why the bugzilla ticket is restricted
Very interesting reading - thanks for the link. Some of those posts suggest that the newer 3Ware firmware may "improve" the issue. Otherwise, I guess I'm in the market for a different 4 channel RAID 5 SATA card. Any recommendations?
I am running an Adaptec 2410SA 4 port SATA raid card on x86_64 in FC4 and have had no problems since the third or fourth kernel update of FC3. I recently copied my entire array to another network server, upgraded my firmware on the card and rebuilt the array, and copied everything back... about 200GB. My only trouble is that I can't seem to get the raid tools working. But I haven't really been working on that since about mid FC3.
On Thursday 30 June 2005 00:45, Mark Miksis wrote:
I thought that when you update the driver, it automatically updates the firmware?
Hmmm... Honestly, I don't really know. The 7000 series manuals somewhere lists the installation of driver and firmware separately and I've done it that way ever since... Check your firmware version, put in the new driver and check again :-) Let me know what you find out...
Peter.
Peter Arremann wrote:
On Wednesday 29 June 2005 21:58, Mark Miksis wrote:
Very interesting reading - thanks for the link. Some of those posts suggest that the newer 3Ware firmware may "improve" the issue. Otherwise, I guess I'm in the market for a different 4 channel RAID 5 SATA card. Any recommendations?
Actually, I have some 9500s that are finally getting stable. Each new driver revision and each new firmware is better than the last and by now, I have no issues (also running the x86_64 but on amd hardware).
Anyone who asks me, usually gets a 8500 controller as recommendation because unlike the 9500s, those are actually rock solid. You should try the firmware updates and maybe the driver (if there is one newer than the one in the kernel you're running) and see if you can't get your box stable that way - as I said before, it worked for me.
Also, you might want to get an AMD system next time if you have heavy IO and more than 4GB ram - the AMD iommu requires no bounce buffers...
Peter.
Well, I upgraded to the latest driver today and the problem persists. (By the way, upgrading via the driver source does also upgrade the firmware and BIOS.) I spoke to 3Ware support. They seemed familiar with the problem but blamed it on Red Hat's Kernels. They then referred me to some performance tuning appnotes which I tried with no success.
I guess I'm off to buy a new card this afternoon - probably LSI or Adaptec.
Mark Miksis wrote:
Well, I upgraded to the latest driver today and the problem persists. (By the way, upgrading via the driver source does also upgrade the firmware and BIOS.) I spoke to 3Ware support. They seemed familiar with the problem but blamed it on Red Hat's Kernels. They then referred me to some performance tuning appnotes which I tried with no success.
I guess I'm off to buy a new card this afternoon - probably LSI or Adaptec.
I've been running the 3Ware card with the write cache turned off because I was nervous about not having any battery backup on the card itself. Turning the cache on seems to have completely solved my problem! I've been running heavy traffic (both reads and writes) for the last few hours and it's been rock solid and very fast. I guess I learned something today. I'm still not sure I'm comfortable running in this configuration, but I do have a UPS on the box.
BTW, this means my 3Ware card is not available;)
On Thursday 30 June 2005 21:35, Mark Miksis wrote:
I've been running the 3Ware card with the write cache turned off because I was nervous about not having any battery backup on the card itself. Turning the cache on seems to have completely solved my problem! I've been running heavy traffic (both reads and writes) for the last few hours and it's been rock solid and very fast. I guess I learned something today. I'm still not sure I'm comfortable running in this configuration, but I do have a UPS on the box.
Very interesting - what sodimm are you using? Maybe its simply a defective cache module... ?
BTW, this means my 3Ware card is not available;)
Oh darn :-) Was wanting one for home to play around with but not willing to spend so much money just for yet another toy... :-D
Peter.
Peter Arremann wrote:
On Thursday 30 June 2005 21:35, Mark Miksis wrote:
I've been running the 3Ware card with the write cache turned off because I was nervous about not having any battery backup on the card itself. Turning the cache on seems to have completely solved my problem! I've been running heavy traffic (both reads and writes) for the last few hours and it's been rock solid and very fast. I guess I learned something today. I'm still not sure I'm comfortable running in this configuration, but I do have a UPS on the box.
Very interesting - what sodimm are you using? Maybe its simply a defective cache module... ?
I never looked at the sodimm - it's whatever the card came with. If it was bad, I'd expect the card to fail with the cache enabled, not the other way around.
BTW, this means my 3Ware card is not available;)
Oh darn :-) Was wanting one for home to play around with but not willing to spend so much money just for yet another toy... :-D
Peter.
On Thursday 30 June 2005 23:04, Mark Miksis wrote:
I never looked at the sodimm - it's whatever the card came with. If it was bad, I'd expect the card to fail with the cache enabled, not the other way around.
Oops - sorry, I misread your post. But at least that gives me something to try out - I'll disable the cache on one of the boxes we've got and see if it fails as well...
Peter.