My desktop machine started having random crashes last summer when I installed Fedora 13 (x86_64) on it. Now I upgraded to Fedora 14, and this has only got worse: before, uptimes could be anything between 3 minutes and a month, now it's a few hours at most.
I never found anything interesting in the logs.
When running without X, during a crash I saw a message on the console mentioning mcelog, so I installed and used it as a daemon, and also via /sys/devices/system/machinecheck/machinecheck0/trigger as described here: http://www.kernel.org/pub/linux/utils/cpu/mce/README But it didn't report anything.
I checked temperatures with sensors: nothing. Ran memtest86+ for hours: the same.
Is there something else that I should check?
Andras
On Sat, Feb 26, 2011 at 18:21:05 +0100, Andras Simon szajmi@gmail.com wrote:
I never found anything interesting in the logs.
Did you notice any other patterns?
I have seen hard to track down bugs that were correlated with system activity. For example I am pretty sure there is a bug with resyncing software raid 1 arrays in 2.6.36 that causes system hangs / crashes.
In the past I have seen crashes that appear to be correlated with high network traffic.
On Sat, 2011-02-26 at 18:21 +0100, Andras Simon wrote:
My desktop machine started having random crashes last summer when I installed Fedora 13 (x86_64) on it.
(snip)
A bit more information would help us make some useful suggestions --
What kind of hardware? (lshw can be useful here)
What kind of crashes? - system turns off abruptly - system locks up from GUI but is still pingable - system locks up and is unresponsive both locally and on the network - gui shuts down and error messages spray all over the console - ...?
What's your system configuration, in particular the video configuration (in terms of both hardware and software)?
Any health alerts from your hardware? You mention checking temperature and seeing nothing unusual, anything from the disk via SMART? If your BIOS reports voltage levels, are they normal?
Any correlation between activity and crashes? Does it happen when you're using the machine, or away from it for a few minutes? Does it matter what you're doing?
-Chris
On 02/26/2011 05:33 PM, Chris Tyler wrote:
On Sat, 2011-02-26 at 18:21 +0100, Andras Simon wrote:
My desktop machine started having random crashes last summer when I installed Fedora 13 (x86_64) on it.
(snip)
A bit more information would help us make some useful suggestions --
What kind of hardware? (lshw can be useful here)
What kind of crashes?
- system turns off abruptly
- system locks up from GUI but is still pingable
- system locks up and is unresponsive both locally and on the network
- gui shuts down and error messages spray all over the console
- ...?
What's your system configuration, in particular the video configuration (in terms of both hardware and software)?
Any health alerts from your hardware? You mention checking temperature and seeing nothing unusual, anything from the disk via SMART? If your BIOS reports voltage levels, are they normal?
Any correlation between activity and crashes? Does it happen when you're using the machine, or away from it for a few minutes? Does it matter what you're doing?
-Chris
In addition, is the kernel tainted?
I've found (start around the same time - last summer) that I cannot get nvidia drivers (from nvidia's website) to work on my FC13/14 machine without random crashes, however if I compile the stock kernel it's rock solid.
Albert.
On 2/26/11, agraham agraham@g-b.net wrote:
On 02/26/2011 05:33 PM, Chris Tyler wrote:
On Sat, 2011-02-26 at 18:21 +0100, Andras Simon wrote:
My desktop machine started having random crashes last summer when I installed Fedora 13 (x86_64) on it.
(snip)
A bit more information would help us make some useful suggestions --
What kind of hardware? (lshw can be useful here)
What kind of crashes?
- system turns off abruptly
- system locks up from GUI but is still pingable
- system locks up and is unresponsive both locally and on the network
- gui shuts down and error messages spray all over the console
- ...?
What's your system configuration, in particular the video configuration (in terms of both hardware and software)?
Any health alerts from your hardware? You mention checking temperature and seeing nothing unusual, anything from the disk via SMART? If your BIOS reports voltage levels, are they normal?
Any correlation between activity and crashes? Does it happen when you're using the machine, or away from it for a few minutes? Does it matter what you're doing?
-Chris
In addition, is the kernel tainted?
No, it's vanilla F14. I did a minimal install, and got the first crash after the first reboot.
Andras
On 2/26/11, Chris Tyler chris@tylers.info wrote:
On Sat, 2011-02-26 at 18:21 +0100, Andras Simon wrote:
My desktop machine started having random crashes last summer when I installed Fedora 13 (x86_64) on it.
(snip)
A bit more information would help us make some useful suggestions --
Sure!
What kind of hardware? (lshw can be useful here)
I copied lshw's output to the end of this mail.
What kind of crashes?
- system turns off abruptly
- system locks up from GUI but is still pingable
- system locks up and is unresponsive both locally and on the network
- gui shuts down and error messages spray all over the console
- ...?
If X runs, then it dies and only a blank console with a cursor at the upper left corner can be seen for half a minute; then reboot starts. Otherwise, there's a longish message on the console about a machine check exception, saying a lot of things, such as 'panic occurred, switching back to text console'. This is the message that mentions mcelog. It also warns about a reboot in 30 seconds, and indeed reboots in 30 secs. I'd love to capture the whole message, but by this time the machine is unresponsive, and 30 seconds is way too short for me to copy it all.
What's your system configuration, in particular the video configuration (in terms of both hardware and software)?
See the output of lshw below. The video card is Nvidia Geforce 7300GT, with the nouveau driver. (Back when it all started with F13, I tried the nvidia driver, but it didn't help.)
Any health alerts from your hardware? You mention checking temperature and seeing nothing unusual, anything from the disk via SMART? If your BIOS reports voltage levels, are they normal?
SMART was silent with F13. In F14 I've only just installed smartmontools.
As for voltage levels: this is what sensors says about them (if this is what you mean):
it8718-isa-0290 Adapter: ISA adapter in0: +1.15 V (min = +0.00 V, max = +4.08 V) in1: +1.90 V (min = +0.00 V, max = +4.08 V) in2: +3.34 V (min = +0.00 V, max = +4.08 V) in3: +2.93 V (min = +0.00 V, max = +4.08 V) in4: +0.22 V (min = +0.00 V, max = +4.08 V) in5: +0.00 V (min = +0.00 V, max = +4.08 V) ALARM in6: +1.26 V (min = +0.00 V, max = +4.08 V) in7: +3.09 V (min = +0.00 V, max = +4.08 V) Vbat: +4.08 V
I have no idea what these numbers mean; that ALARM looks alarming, but it is (and as far as I can remember, has been) there all the time.
The BIOS (and sensors) also say that the fan in the power supply doesn't work, but it does.
Any correlation between activity and crashes? Does it happen when you're using the machine, or away from it for a few minutes? Does it matter what you're doing?
The crashes seem to happen more frequently when I'm doing something interactively (even if it's just browsing, running yum). This morning it was running for hours under heavy load with no problems, but crashed a few minutes after I started interacting with it.
And now the output of lshw:
description: Desktop Computer product: 965P-DS3 () vendor: Gigabyte Technology Co., Ltd. width: 64 bits capabilities: smbios-2.4 dmi-2.4 vsyscall64 vsyscall32 configuration: boot=normal chassis=desktop uuid=00000000-0000-0000-0000-001A4D670455 *-core description: Motherboard product: 965P-DS3 vendor: Gigabyte Technology Co., Ltd. physical id: 0 *-firmware description: BIOS vendor: Award Software International, Inc. physical id: 0 version: F10 date: 01/12/2007 size: 128KiB capacity: 960KiB capabilities: pci pnp apm upgrade shadowing cdboot bootselect edd int13floppy360 int13floppy1200 int13floppy720 int13floppy2880 int5printscreen int9keyboard int14serial int17printer int10video acpi usb ls120boot zipboot biosbootspecification *-cpu description: CPU product: Intel(R) Core(TM)2 CPU 6420 @ 2.13GHz vendor: Intel Corp. physical id: 4 bus info: cpu@0 version: Intel(R) Core(TM)2 CPU 642 slot: Socket 775 size: 1600MHz capacity: 4GHz width: 64 bits clock: 266MHz capabilities: fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx x86-64 constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm lahf_lm tpr_shadow *-cache:0 description: L1 cache physical id: a slot: Internal Cache size: 64KiB capacity: 64KiB capabilities: synchronous internal write-back *-cache:1 description: L2 cache physical id: b slot: External Cache size: 4MiB capabilities: synchronous internal write-back *-memory description: System Memory physical id: 1b slot: System board or motherboard size: 2GiB *-bank:0 description: DIMM 800 MHz (1.2 ns) physical id: 0 slot: A0 size: 1GiB width: 64 bits clock: 800MHz (1.2ns) *-bank:1 description: DIMM [empty] physical id: 1 slot: A1 *-bank:2 description: DIMM 800 MHz (1.2 ns) physical id: 2 slot: A2 size: 1GiB width: 64 bits clock: 800MHz (1.2ns) *-bank:3 description: DIMM [empty] physical id: 3 slot: A3 *-pci description: Host bridge product: 82P965/G965 Memory Controller Hub vendor: Intel Corporation physical id: 100 bus info: pci@0000:00:00.0 version: 02 width: 32 bits clock: 33MHz *-pci:0 description: PCI bridge product: 82P965/G965 PCI Express Root Port vendor: Intel Corporation physical id: 1 bus info: pci@0000:00:01.0 version: 02 width: 32 bits clock: 33MHz capabilities: pci pm msi pciexpress normal_decode bus_master cap_list configuration: driver=pcieport resources: irq:40 ioport:8000(size=4096) memory:f4000000-f6ffffff ioport:e0000000(size=268435456) *-display description: VGA compatible controller product: G73 [GeForce 7300 GT] vendor: nVidia Corporation physical id: 0 bus info: pci@0000:01:00.0 version: a1 width: 64 bits clock: 33MHz capabilities: pm msi pciexpress vga_controller bus_master cap_list rom configuration: driver=nouveau latency=0 resources: irq:16 memory:f4000000-f4ffffff memory:e0000000-efffffff memory:f5000000-f5ffffff ioport:8000(size=128) memory:f6000000-f601ffff *-usb:0 description: USB Controller product: 82801H (ICH8 Family) USB UHCI Controller #4 vendor: Intel Corporation physical id: 1a bus info: pci@0000:00:1a.0 version: 02 width: 32 bits clock: 33MHz capabilities: uhci bus_master configuration: driver=uhci_hcd latency=0 resources: irq:16 ioport:c000(size=32) *-usbhost product: UHCI Host Controller vendor: Linux 2.6.35.11-83.fc14.x86_64 uhci_hcd physical id: 1 bus info: usb@3 logical name: usb3 version: 2.06 capabilities: usb-1.10 configuration: driver=hub slots=2 speed=12Mbit/s *-usb:1 description: USB Controller product: 82801H (ICH8 Family) USB UHCI Controller #5 vendor: Intel Corporation physical id: 1a.1 bus info: pci@0000:00:1a.1 version: 02 width: 32 bits clock: 33MHz capabilities: uhci bus_master configuration: driver=uhci_hcd latency=0 resources: irq:21 ioport:c400(size=32) *-usbhost product: UHCI Host Controller vendor: Linux 2.6.35.11-83.fc14.x86_64 uhci_hcd physical id: 1 bus info: usb@4 logical name: usb4 version: 2.06 capabilities: usb-1.10 configuration: driver=hub slots=2 speed=12Mbit/s *-usb:2 description: USB Controller product: 82801H (ICH8 Family) USB2 EHCI Controller #2 vendor: Intel Corporation physical id: 1a.7 bus info: pci@0000:00:1a.7 version: 02 width: 32 bits clock: 33MHz capabilities: pm ehci bus_master cap_list configuration: driver=ehci_hcd latency=0 resources: irq:18 memory:f9205000-f92053ff *-usbhost product: EHCI Host Controller vendor: Linux 2.6.35.11-83.fc14.x86_64 ehci_hcd physical id: 1 bus info: usb@1 logical name: usb1 version: 2.06 capabilities: usb-2.00 configuration: driver=hub slots=4 speed=480Mbit/s *-multimedia description: Audio device product: 82801H (ICH8 Family) HD Audio Controller vendor: Intel Corporation physical id: 1b bus info: pci@0000:00:1b.0 version: 02 width: 64 bits clock: 33MHz capabilities: pm msi pciexpress bus_master cap_list configuration: driver=HDA Intel latency=0 resources: irq:45 memory:f9200000-f9203fff *-pci:1 description: PCI bridge product: 82801H (ICH8 Family) PCI Express Port 1 vendor: Intel Corporation physical id: 1c bus info: pci@0000:00:1c.0 version: 02 width: 32 bits clock: 33MHz capabilities: pci pciexpress msi pm normal_decode bus_master cap_list configuration: driver=pcieport resources: irq:41 ioport:7000(size=4096) memory:80000000-801fffff ioport:80200000(size=2097152) *-pci:2 description: PCI bridge product: 82801H (ICH8 Family) PCI Express Port 4 vendor: Intel Corporation physical id: 1c.3 bus info: pci@0000:00:1c.3 version: 02 width: 32 bits clock: 33MHz capabilities: pci pciexpress msi pm normal_decode bus_master cap_list configuration: driver=pcieport resources: irq:42 ioport:9000(size=8192) memory:f9000000-f90fffff ioport:80400000(size=2097152) *-storage description: SATA controller product: JMB362/JMB363 Serial ATA Controller vendor: JMicron Technology Corp. physical id: 0 bus info: pci@0000:03:00.0 version: 02 width: 32 bits clock: 33MHz capabilities: storage pm pciexpress ahci_1.0 bus_master cap_list configuration: driver=ahci latency=0 resources: irq:19 memory:f9000000-f9001fff *-ide description: IDE interface product: JMB362/JMB363 Serial ATA Controller vendor: JMicron Technology Corp. physical id: 0.1 bus info: pci@0000:03:00.1 logical name: scsi6 version: 02 width: 32 bits clock: 33MHz capabilities: ide pm bus_master cap_list emulated configuration: driver=pata_jmicron latency=0 resources: irq:16 ioport:9000(size=8) ioport:9400(size=4) ioport:9800(size=8) ioport:9c00(size=4) ioport:a000(size=16) *-cdrom description: DVD writer product: DVD-RW DVR-112D vendor: PIONEER physical id: 0.0.0 bus info: scsi@6:0.0.0 logical name: /dev/cdrom logical name: /dev/cdrw logical name: /dev/dvd logical name: /dev/dvdrw logical name: /dev/scd0 logical name: /dev/sr0 version: 1.15 capabilities: removable audio cd-r cd-rw dvd dvd-r configuration: ansiversion=5 status=ready *-medium physical id: 0 logical name: /dev/cdrom *-pci:3 description: PCI bridge product: 82801H (ICH8 Family) PCI Express Port 5 vendor: Intel Corporation physical id: 1c.4 bus info: pci@0000:00:1c.4 version: 02 width: 32 bits clock: 33MHz capabilities: pci pciexpress msi pm normal_decode bus_master cap_list configuration: driver=pcieport resources: irq:43 ioport:b000(size=4096) memory:f7000000-f8ffffff ioport:80600000(size=2097152) *-network description: Ethernet interface product: 88E8056 PCI-E Gigabit Ethernet Controller vendor: Marvell Technology Group Ltd. physical id: 0 bus info: pci@0000:04:00.0 logical name: eth0 version: 12 serial: 00:1a:4d:67:04:55 size: 100Mbit/s capacity: 1Gbit/s width: 64 bits clock: 33MHz capabilities: pm vpd msi pciexpress bus_master cap_list rom ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt 1000bt-fd autonegotiation configuration: autonegotiation=on broadcast=yes driver=sky2 driverversion=1.28 duplex=full firmware=N/A ip=192.168.1.97 latency=0 link=yes multicast=yes port=twisted pair speed=100Mbit/s resources: irq:44 memory:f8000000-f8003fff ioport:b000(size=256) memory:80600000-8061ffff *-usb:3 description: USB Controller product: 82801H (ICH8 Family) USB UHCI Controller #1 vendor: Intel Corporation physical id: 1d bus info: pci@0000:00:1d.0 version: 02 width: 32 bits clock: 33MHz capabilities: uhci bus_master configuration: driver=uhci_hcd latency=0 resources: irq:23 ioport:c800(size=32) *-usbhost product: UHCI Host Controller vendor: Linux 2.6.35.11-83.fc14.x86_64 uhci_hcd physical id: 1 bus info: usb@5 logical name: usb5 version: 2.06 capabilities: usb-1.10 configuration: driver=hub slots=2 speed=12Mbit/s *-usb:4 description: USB Controller product: 82801H (ICH8 Family) USB UHCI Controller #2 vendor: Intel Corporation physical id: 1d.1 bus info: pci@0000:00:1d.1 version: 02 width: 32 bits clock: 33MHz capabilities: uhci bus_master configuration: driver=uhci_hcd latency=0 resources: irq:19 ioport:cc00(size=32) *-usbhost product: UHCI Host Controller vendor: Linux 2.6.35.11-83.fc14.x86_64 uhci_hcd physical id: 1 bus info: usb@6 logical name: usb6 version: 2.06 capabilities: usb-1.10 configuration: driver=hub slots=2 speed=12Mbit/s *-usb:5 description: USB Controller product: 82801H (ICH8 Family) USB UHCI Controller #3 vendor: Intel Corporation physical id: 1d.2 bus info: pci@0000:00:1d.2 version: 02 width: 32 bits clock: 33MHz capabilities: uhci bus_master configuration: driver=uhci_hcd latency=0 resources: irq:18 ioport:d000(size=32) *-usbhost product: UHCI Host Controller vendor: Linux 2.6.35.11-83.fc14.x86_64 uhci_hcd physical id: 1 bus info: usb@7 logical name: usb7 version: 2.06 capabilities: usb-1.10 configuration: driver=hub slots=2 speed=12Mbit/s *-usb:6 description: USB Controller product: 82801H (ICH8 Family) USB2 EHCI Controller #1 vendor: Intel Corporation physical id: 1d.7 bus info: pci@0000:00:1d.7 version: 02 width: 32 bits clock: 33MHz capabilities: pm ehci bus_master cap_list configuration: driver=ehci_hcd latency=0 resources: irq:23 memory:f9204000-f92043ff *-usbhost product: EHCI Host Controller vendor: Linux 2.6.35.11-83.fc14.x86_64 ehci_hcd physical id: 1 bus info: usb@2 logical name: usb2 version: 2.06 capabilities: usb-2.00 configuration: driver=hub slots=6 speed=480Mbit/s *-usb description: Mass storage device product: FCR-HS219/1 Mobile Reader vendor: Kingston physical id: 2 bus info: usb@2:2 logical name: scsi8 version: 97.15 serial: 000004213 capabilities: usb-2.00 scsi emulated scsi-host configuration: driver=usb-storage maxpower=500mA speed=480Mbit/s *-disk:0 description: SCSI Disk product: FCR-HS219/1 vendor: Kingston physical id: 0.0.0 bus info: scsi@8:0.0.0 logical name: /dev/sdb version: 9715 capabilities: removable *-medium physical id: 0 logical name: /dev/sdb *-disk:1 description: SCSI Disk product: FCR-HS219/1 vendor: Kingston physical id: 0.0.1 bus info: scsi@8:0.0.1 logical name: /dev/sdc version: 9715 capabilities: removable *-medium physical id: 0 logical name: /dev/sdc *-disk:2 description: SCSI Disk product: FCR-HS219/1 vendor: Kingston physical id: 0.0.2 bus info: scsi@8:0.0.2 logical name: /dev/sdd version: 9715 capabilities: removable *-medium physical id: 0 logical name: /dev/sdd *-disk:3 description: SCSI Disk product: FCR-HS219/1 vendor: Kingston physical id: 0.0.3 bus info: scsi@8:0.0.3 logical name: /dev/sde version: 9715 capabilities: removable *-medium physical id: 0 logical name: /dev/sde *-pci:4 description: PCI bridge product: 82801 PCI Bridge vendor: Intel Corporation physical id: 1e bus info: pci@0000:00:1e.0 version: f2 width: 32 bits clock: 33MHz capabilities: pci subtractive_decode bus_master cap_list resources: memory:f9100000-f91fffff *-multimedia description: Multimedia controller product: SAA7131/SAA7133/SAA7135 Video Broadcast Decoder vendor: Philips Semiconductors physical id: 0 bus info: pci@0000:05:00.0 version: d1 width: 32 bits clock: 33MHz capabilities: pm bus_master cap_list configuration: driver=saa7134 latency=32 maxlatency=32 mingnt=84 resources: irq:20 memory:f9100000-f91007ff *-isa description: ISA bridge product: 82801HB/HR (ICH8/R) LPC Interface Controller vendor: Intel Corporation physical id: 1f bus info: pci@0000:00:1f.0 version: 02 width: 32 bits clock: 33MHz capabilities: isa bus_master cap_list configuration: latency=0 *-ide:0 description: IDE interface product: 82801H (ICH8 Family) 4 port SATA IDE Controller vendor: Intel Corporation physical id: 1f.2 bus info: pci@0000:00:1f.2 logical name: scsi2 version: 02 width: 32 bits clock: 66MHz capabilities: ide pm bus_master cap_list emulated configuration: driver=ata_piix latency=0 resources: irq:19 ioport:1f0(size=8) ioport:3f6 ioport:170(size=8) ioport:376 ioport:f000(size=16) ioport:fc00(size=16) *-disk description: ATA Disk product: ST3320620AS vendor: Seagate physical id: 0.0.0 bus info: scsi@2:0.0.0 logical name: /dev/sda version: 3.AA serial: 6QF07NPQ size: 298GiB (320GB) capabilities: partitioned partitioned:dos configuration: ansiversion=5 signature=0005269c *-volume:0 description: Windows NTFS volume physical id: 1 bus info: scsi@2:0.0.0,1 logical name: /dev/sda1 version: 3.1 serial: d6d2500d-c252-d046-9bd7-8153b1e39865 size: 10001MiB capacity: 10001MiB capabilities: primary bootable ntfs initialized configuration: clustersize=4096 created=2007-06-03 23:20:02 filesystem=ntfs state=clean *-volume:1 description: EXT4 volume vendor: Linux physical id: 2 bus info: scsi@2:0.0.0,2 logical name: /dev/sda2 logical name: /boot version: 1.0 serial: dd9c40f0-7fe7-4182-8425-3429b3b92244 size: 101MiB capacity: 101MiB capabilities: primary journaled extended_attributes huge_files dir_nlink recover extents ext4 ext2 initialized configuration: created=2011-02-25 21:52:53 filesystem=ext4 lastmountpoint=/boot modified=2011-02-26 18:58:56 mount.fstype=ext4 mount.options=rw,seclabel,relatime,barrier=1,data=ordered mounted=2011-02-26 18:58:56 state=mounted *-volume:2 description: EXT3 volume vendor: Linux physical id: 3 bus info: scsi@2:0.0.0,3 logical name: /dev/sda3 version: 1.0 serial: 86c8dd4f-c4c7-453d-b5c3-5c82dd9ea2ef size: 101MiB capacity: 101MiB capabilities: primary journaled extended_attributes ext3 ext2 initialized configuration: created=2009-01-18 19:10:35 filesystem=ext3 label=boot2 modified=2010-09-30 01:29:24 mounted=2010-09-30 01:28:19 state=clean *-volume:3 description: Extended partition physical id: 4 bus info: scsi@2:0.0.0,4 logical name: /dev/sda4 size: 288GiB capacity: 288GiB capabilities: primary extended partitioned partitioned:extended *-logicalvolume:0 description: Linux swap / Solaris partition physical id: 5 logical name: /dev/sda5 capacity: 8001MiB capabilities: nofs *-logicalvolume:1 description: Linux filesystem partition physical id: 6 logical name: /dev/sda6 logical name: / capacity: 10001MiB configuration: mount.fstype=ext4 mount.options=rw,seclabel,relatime,barrier=1,data=ordered state=mounted *-logicalvolume:2 description: Linux filesystem partition physical id: 7 logical name: /dev/sda7 capacity: 10001MiB *-logicalvolume:3 description: Linux filesystem partition physical id: 8 logical name: /dev/sda8 logical name: /tmp capacity: 10001MiB configuration: mount.fstype=ext3 mount.options=rw,seclabel,relatime,errors=continue,user_xattr,acl,barrier=0,data=ordered state=mounted *-logicalvolume:4 description: Linux filesystem partition physical id: 9 logical name: /dev/sda9 logical name: /usr capacity: 14GiB configuration: mount.fstype=ext4 mount.options=rw,seclabel,relatime,barrier=1,data=ordered state=mounted *-logicalvolume:5 description: Linux filesystem partition physical id: a logical name: /dev/sda10 capacity: 14GiB *-logicalvolume:6 description: Linux filesystem partition physical id: b logical name: /dev/sda11 logical name: /home capacity: 4996MiB configuration: mount.fstype=ext3 mount.options=rw,seclabel,relatime,errors=continue,barrier=0,data=ordered state=mounted *-logicalvolume:7 description: Linux filesystem partition physical id: c logical name: /dev/sda12 logical name: /usr/local capacity: 136GiB configuration: mount.fstype=ext3 mount.options=rw,seclabel,relatime,errors=continue,barrier=0,data=ordered state=mounted *-logicalvolume:8 description: Linux filesystem partition physical id: d logical name: /dev/sda13 logical name: /opt/home capacity: 80GiB configuration: mount.fstype=ext3 mount.options=rw,seclabel,relatime,errors=continue,barrier=0,data=ordered state=mounted *-serial description: SMBus product: 82801H (ICH8 Family) SMBus Controller vendor: Intel Corporation physical id: 1f.3 bus info: pci@0000:00:1f.3 version: 02 width: 32 bits clock: 33MHz configuration: driver=i801_smbus latency=0 resources: irq:18 memory:f9206000-f92060ff ioport:500(size=32) *-ide:1 description: IDE interface product: 82801H (ICH8 Family) 2 port SATA IDE Controller vendor: Intel Corporation physical id: 1f.5 bus info: pci@0000:00:1f.5 version: 02 width: 32 bits clock: 66MHz capabilities: ide pm bus_master cap_list configuration: driver=ata_piix latency=0 resources: irq:19 ioport:d800(size=8) ioport:dc00(size=4) ioport:e000(size=8) ioport:e400(size=4) ioport:e800(size=16) ioport:ec00(size=16)
Thanks!
Andras
On Sat, Feb 26, 2011 at 20:02:26 +0100, Andras Simon szajmi@gmail.com wrote:
The crashes seem to happen more frequently when I'm doing something interactively (even if it's just browsing, running yum). This morning it was running for hours under heavy load with no problems, but crashed a few minutes after I started interacting with it.
That sounds potentially like a video driver issue. If you are using compiz, you could try metacity instead as that does less 3D stuff.
Nouveau seems better in F15, but upower tweaks an issue that causes annoying screen flashes every 30 seonds and for at least some nVidia cards (including an nv28 I have on one machine), the blank period is about 1/2 second.
Once the alpha release is out, you might try a live image of it to see if that resolves your crashing problem.
On 2/26/11, Bruno Wolff III bruno@wolff.to wrote:
On Sat, Feb 26, 2011 at 20:02:26 +0100, Andras Simon szajmi@gmail.com wrote:
The crashes seem to happen more frequently when I'm doing something interactively (even if it's just browsing, running yum). This morning it was running for hours under heavy load with no problems, but crashed a few minutes after I started interacting with it.
It seems I was too eager to find a clue: just now it crashed while I was away (in the middle of installing a lot of packages).
That sounds potentially like a video driver issue. If you are using compiz, you could try metacity instead as that does less 3D stuff.
No compiz here. And crashes happen without X running.
Nouveau seems better in F15, but upower tweaks an issue that causes annoying screen flashes every 30 seonds and for at least some nVidia cards (including an nv28 I have on one machine), the blank period is about 1/2 second.
Back in the summer, I was also suspicious about nouveau, until I tried the nvidia driver...
Once the alpha release is out, you might try a live image of it to see if that resolves your crashing problem.
In fact I'm planning to try a 32 bit F14 live image first, if I can keep this thing alive for long enough to write out a cd (this MB doesn't boot from a usb stick).
Andras
--- On Sat, 2/26/11, Andras Simon szajmi@gmail.com wrote:
On 2/26/11, Bruno Wolff III bruno@wolff.to wrote:
On Sat, Feb 26, 2011 at 20:02:26 +0100, Andras Simon szajmi@gmail.com
wrote:
The crashes seem to happen more frequently when
I'm doing something
interactively (even if it's just browsing, running
yum). This morning
it was running for hours under heavy load with no
problems, but
crashed a few minutes after I started interacting
with it.
[snip]
Once the alpha release is out, you might try a live
image of it to see if
that resolves your crashing problem.
In fact I'm planning to try a 32 bit F14 live image first, if I can keep this thing alive for long enough to write out a cd (this MB doesn't boot from a usb stick).
TO: OP
If you think it's specifically a Fedora problem (I don't), I would install (a real dual boot install--not a LiveCD) distro other than Fedora that has the same versions of kernel, GNOME, etc. to see if the problem persists. Then try one that has different kernel, etc.
Since this problem has continued over several versions of Fedora, I think it's hardware related. I had a similar problem with a system years ago: Periodically, during heavy CPU use, the system "stopped" as if someone had turned it off; however, the MB was still powered as the fans were running and power lights on. The system was only about 2 years old at the time. Overheating CPU was my first thought. Took the system apart and cleaned everything. It ran for several weeks with no problems, then it happened again. This behavior continued but more and more frequently for over a year through two versions of Fedora Core (5 and 6). Then reboots began to stall, even if the system was left off for some time, even overnight. (I was still thinking overheating, but all temps seemed nominal. Bad thermistor, maybe? Or bad MB.) I had already replaced the power supply and the graphics card. And done many memtests. The problem continued with increasing frequency. I never had anything in the logs indicative of the problem. Fortunately, the MB, an MSI, IIRC, had LEDs that would indicate the stage of POST during start up. (I had forgotten about this feature.) I began checking those when booting. When a reboot failed, POST was always at the CPU check. I replaced the CPU and the problem never occurred again. And at 8 years old when I gave it away, the system was still working perfectly. Food for thought.
B
On 2/27/11, Patrick Bartek bartek047@yahoo.com wrote:
TO: OP
If you think it's specifically a Fedora problem (I don't), I would install
Me neither. It's just that I still think it's possible that it's not a HW problem.
(a real dual boot install--not a LiveCD) distro other than Fedora that has the same versions of kernel, GNOME, etc. to see if the problem persists. Then try one that has different kernel, etc.
Is there a reason why a real install is better in this case? The disk can be exercised with a live cd, too. Of course, one is constrained to the applications that are on the cd, but still... And finding a distro with the same components is probably impossible.
BTW, I'm running a 32 bit live cd now and it's been chugging along under heavyish load (load avg above 4) with no problems. Doesn't mean a thing, of course... But if it goes on like this, then trying the 64 bit live cd or installing the 32 bit version of F14 will probably make sense.
Andras
Andras
On Sun, 27 Feb 2011 20:08:58 +0100 Andras Simon szajmi@gmail.com wrote:
On 2/27/11, Patrick Bartek bartek047@yahoo.com wrote:
TO: OP
If you think it's specifically a Fedora problem (I don't), I would install
Me neither. It's just that I still think it's possible that it's not a HW problem.
I'm going to offer some support for this opinion. Anecdotal only, though. Since around F11 or F12, I've had problems with lockups when running the stock Fedora kernel. So in each case, I've compiled a custom kernel from the src.rpm of the kernels that Fedora distributes. I tune it to eliminate any hardware I don't have on my system in order to cut down on the compile time (from an hour to as low as 10 minutes) and also to tune for performance and eliminate features that I don't use. The lockups then go away. I never see one again.
For me, this seems to happen when I eliminate SMP. I think this is because I have a single core CPU, and the scheduler doesn't compensate for this properly. No proof, not even evidence other than when I do this I no longer have lockups. And it could be an interaction with something else I've removed. The kernel is a very complicated beast.
Starting with the 2.6.35 series used in F14, I have to patch the kernel in order to compile the kernel with no SMP. This is because the code hasn't been properly fenced in with ifdefs. I imagine that by this point there is no one developing the kernel who is actually using a single core machine, so it is understandable that they aren't testing whether single core works or not. I did open a bugzilla, but it is unlikely to see any action for the same reason.
Here is the link for building a custom kernel. http://fedoraproject.org/wiki/Building_a_custom_kernel
Here is the patch if you are going to compile with single core set.
--- kernel-2.6.35.noarch/kernel/sched.c 2010-10-16 09:27:21.017080819 -0700 +++ kernel-2.6.35.noarch/kernel/sched.c 2010-10-16 09:31:09.299373307 -0700 @@ -5273,7 +5273,9 @@ void __cpuinit init_idle(struct task_str unsigned long flags;
local_irq_save(flags); +#if defined(CONFIG_SMP) double_rq_lock(oldrq, rq); +#endif
__sched_fork(idle); idle->state = TASK_RUNNING; @@ -5298,7 +5300,9 @@ void __cpuinit init_idle(struct task_str #if defined(CONFIG_SMP) && defined(__ARCH_WANT_UNLOCKED_CTXSW) idle->oncpu = 1; #endif +#if defined(CONFIG_SMP) double_rq_unlock(oldrq, rq); +#endif local_irq_restore(flags);
/* Set the preempt count _outside_ the spinlocks! */
You could also try compiling a stock kernel as I've seen reports on this list by people who use the latest and greatest from kernel.org without any problems. That does remove any fixes that Fedora / RH have made that haven't made it into the mainline kernel yet though.
--- On Sun, 2/27/11, Andras Simon szajmi@gmail.com wrote:
On 2/27/11, Patrick Bartek bartek047@yahoo.com wrote:
TO: OP
If you think it's specifically a Fedora problem (I
don't), I would install
Me neither. It's just that I still think it's possible that it's not a HW problem.
(a real dual boot install--not a LiveCD) distro other
than Fedora that has
the same versions of kernel, GNOME, etc. to see if the
problem persists.
Then try one that has different kernel, etc.
Is there a reason why a real install is better in this case? The disk can be exercised with a live cd, too. Of course, one is constrained to the applications that are on the cd, but still... And finding a distro with the same components is probably impossible.
Installing the "test" distro in the same "environment" as the purported problem distro is the better way to evaluate problems. I've discovered by experience that live distros, Fedora's particularly, don't behave the same as if you install them to the hard drive. For example, I had one Fedora LiveCD that was slow to boot (5 or 6 minutes), erred on some of the hardware, got the video wrong, too--I corrected it manually--etc., but when I installed it, it got everything correct. Go figure. BTW, the iso file and CD burn of that LiveCD checksummed good.
I use LiveCDs only for fixing broken installs or initially testing hardware compatibility. For installation, I prefer the "Install" version of a distro, if it has one, rather than installing the Live version.
BTW, I'm running a 32 bit live cd now and it's been chugging along under heavyish load (load avg above 4) with no problems. Doesn't mean a thing, of course... But if it goes on like this, then trying the 64 bit live cd or installing the 32 bit version of F14 will probably make sense.
Hope everything turns out well.
B
Andras Simon <szajmi <at> gmail.com> writes:
... If X runs, then it dies and only a blank console with a cursor at the upper left corner can be seen for half a minute; then reboot starts. Otherwise, there's a longish message on the console about a machine check exception, saying a lot of things, such as 'panic occurred, switching back to text console'. This is the message that mentions mcelog. ...
Get familiar with this: http://www.cyberciti.biz/tips/linux-server-predicting-hardware-failure.html
JB
On 2/26/11, JB jb.1234abcd@gmail.com wrote:
Get familiar with this: http://www.cyberciti.biz/tips/linux-server-predicting-hardware-failure.html
Thanks, but as I said earlier, mcelog reported nothing.
Andras
JB <jb.1234abcd <at> gmail.com> writes:
...
# yum install mcelog
# cat /etc/cron.hourly/mcelog.cron so, of interest is /var/log/mcelog .
# cat /usr/share/doc/mcelog-.../README so, make sure /dev/mcelog is created .
# man mcelog DESCRIPTION ....
NOTE: ... On newer kernels it can also be triggered directly using the /sys/devices/system/machinecheck/machinecheck0/trigger trigger. In addition mcelog can be used on the command line to decode an existing machine check record ...
JB
On 2/26/11, JB jb.1234abcd@gmail.com wrote:
JB <jb.1234abcd <at> gmail.com> writes:
...
# yum install mcelog
# cat /etc/cron.hourly/mcelog.cron so, of interest is /var/log/mcelog .
# cat /usr/share/doc/mcelog-.../README so, make sure /dev/mcelog is created .
# man mcelog DESCRIPTION ....
NOTE: ... On newer kernels it can also be triggered directly using the /sys/devices/system/machinecheck/machinecheck0/trigger trigger. In addition mcelog can be used on the command line to decode an existing machine check record ...
Yes, as I said before, I tried all these ways if using mcelog, and it reported nothing.
Thanks anyway, Andras
On 27 February 2011 06:02, Andras Simon szajmi@gmail.com wrote:
I'd love to capture the whole message, but by this time the machine is unresponsive, and 30 seconds is way too short for me to copy it all.
Its would be useful to know what the _exact_ error message is, especially if it is consistent every time the crash occurs. Would it be possible to check this by taking a picture of the screen with a digital camera?
Andras Simon <szajmi <at> gmail.com> writes:
... *-firmware description: BIOS vendor: Award Software International, Inc. physical id: 0 version: F10 date: 01/12/2007
You should update the BIOS (and run in default settings for a longer period of time), in particular in this situation.
JB
On 26.02.2011, Andras Simon wrote:
My desktop machine started having random crashes last summer when I installed Fedora 13 (x86_64) on it. Now I upgraded to Fedora 14, and this has only got worse: before, uptimes could be anything between 3 minutes and a month, now it's a few hours at most.
I never found anything interesting in the logs.
Most likely, this is some kind of hardware failure. I have fixed many machines which passed memtest over a period of more than 3 days by changing the memory modules. A BIOS bug, an overheated chipset and/or CPU, faulty memory or simply a chip on your mainboard which got damaged, it can be anything..
On 2/26/11, Heinz Diehl htd@fritha.org wrote:
Most likely, this is some kind of hardware failure. I have fixed many machines which passed memtest over a period of more than 3 days by changing the memory modules. A BIOS bug, an overheated chipset and/or CPU, faulty memory or simply a chip on your mainboard which got damaged, it can be anything..
Well, yes, it can be a hardware failure, but then it's a rather strange coincidence that the crashes started happening immediately after upgrading to F13 (from F10). Besides, even if it's a hardware failure, it'd be nice to know which one is the faulty part...
RE overheated chipset: the north bridge (I _think_ it is the north bridge: an Intel P965) does feel very hot.
Andras
Hello,
I have a similar problem and a FC13 and on a FC14 machines. It seems to be due to the screensaver. I reported the bugs, one today. For the other one, I never got any feedback !
Regards.
On 2/26/11, Patrick Dupre patrick.dupre@york.ac.uk wrote:
Hello,
I have a similar problem and a FC13 and on a FC14 machines. It seems to be due to the screensaver.
I see nothing here that would point in that direction. Of course, you never know...
Andras
On 02/26/2011 12:01 PM, Andras Simon wrote:
On 2/26/11, Patrick Duprepatrick.dupre@york.ac.uk wrote:
Hello,
I have a similar problem and a FC13 and on a FC14 machines. It seems to be due to the screensaver.
I see nothing here that would point in that direction. Of course, you never know...
If the OP has an nVidia card and is using their drivers (either directly or through kmod-nvidia) and has the screensaver set to random, it's a possibility. Several years ago I had the same issue and found, by looking at nVidia's own support forum, that their driver had problems with "line drawing" screensavers that would hang it. Not sure what they meant, or if it's ever been fixed, but I do know that I've not had the slightest trouble with it since I set mine to a specific screensaver, changing it whenever I get tired of looking at it.
On 2/26/11, Joe Zeff joe@zeff.us wrote:
If the OP has an nVidia card and is using their drivers (either directly or through kmod-nvidia) and has the screensaver set to random, it's a
I'm using the nouveau driver and no screensaver.
possibility. Several years ago I had the same issue and found, by looking at nVidia's own support forum, that their driver had problems with "line drawing" screensavers that would hang it. Not sure what they meant, or if it's ever been fixed, but I do know that I've not had the slightest trouble with it since I set mine to a specific screensaver, changing it whenever I get tired of looking at it.
I too have had my share of nvidia troubles back when Fedora was called RH :-)
Andras
On 02/26/2011 01:35 PM, Andras Simon wrote:
On 2/26/11, Heinz Diehlhtd@fritha.org wrote:
Most likely, this is some kind of hardware failure. I have fixed many machines which passed memtest over a period of more than 3 days by changing the memory modules. A BIOS bug, an overheated chipset and/or CPU, faulty memory or simply a chip on your mainboard which got damaged, it can be anything..
Well, yes, it can be a hardware failure, but then it's a rather strange coincidence that the crashes started happening immediately after upgrading to F13 (from F10). Besides, even if it's a hardware failure, it'd be nice to know which one is the faulty part...
Well , maybe you can try running the system with only a DIMM at once in slot 0, if the system is stable with one, and at the next test the machine crash weel that dimm is damaged, if fails both tests the problem is elsewhere
RE overheated chipset: the north bridge (I _think_ it is the north bridge: an Intel P965) does feel very hot.
maybe you can hack a internal fan pointed directly to it
Gabriel
On 2/26/11, Gabriel Ramirez gabriello.ramirez@gmail.com wrote:
On 02/26/2011 01:35 PM, Andras Simon wrote:
Well, yes, it can be a hardware failure, but then it's a rather strange coincidence that the crashes started happening immediately after upgrading to F13 (from F10). Besides, even if it's a hardware failure, it'd be nice to know which one is the faulty part...
Well , maybe you can try running the system with only a DIMM at once in slot 0, if the system is stable with one, and at the next test the machine crash weel that dimm is damaged, if fails both tests the problem is elsewhere
Good idea! Thanks!
RE overheated chipset: the north bridge (I _think_ it is the north bridge: an Intel P965) does feel very hot.
maybe you can hack a internal fan pointed directly to it
So is this! Even though I'm not sure I can do this, but I'll try to come up with something.
Thanks, Andras
Well, yes, it can be a hardware failure, but then it's a rather strange coincidence that the crashes started happening immediately
I do IT for small businesses, and I fix just these sorts of problems. And I can tell you from experience - these kinds of coincidences happen fairly often.
In any case, it always comes down to software or hardware, or both. The hardware is easy - just follow these steps:
1) Open the case and look for stopped or slowly turning fans. Including: cpu fans, case fans, and the fans inside the PSU.
2) Clean dust from the cpu & video card heat sinks if necessary.
3) While looking in the case, look for bad capacitors on the motherboard and inside the PSU. Click on the pictures to see close up: http://en.wikipedia.org/wiki/Capacitor_plague
4) If you're using two or more sticks of ram and the sticks are mismatched, try buying matched pairs. Also, look at the sticker on the ram and see if they require that their voltage be set higher than normal.
5) Inside the PSU, look closely at the mid-sized capacitors where the external wires are soldered to the circuit board. (external wires meaning those that connect to the mainboard and drives) Also, look at the PSU circuit board and its components for indication of high heat. For example, brown areas on the circuit board indicating that something is getting too hot. Components like resistors and rectifiers can get too hot and burn the area around them. Sometimes you'll see black charring where a component has caught on fire. There are no dangerous voltages inside the PSU as long as the power cord is unplugged from the wall. But be sure to unplug from the wall, or you will learn quickly never to do it a second time.
6) If everything looks good so far, put it all back together and run ramtest86+ for a few passes. The more the better.
7) Boot a live cd and read the smart data for the drive(s): use the gnome disk utility (palimpsest), or use smartctl -a /dev/sda, etc. (substitute your actual device) If Reallocated Sectors is more than 0 (zero) the drive is failing. This counts confirmed sector read/write errors. The sectors are disabled by reallocating them. As new bad sectors occur, the computer can freeze or reboot, or programs can crash.
Of course, even if your system passes all those tests, there can still be a bad mainboard or other part.
But if it does pass all that, look seriously at your OS - it could be software problem.
On 2/27/11, Tom Horsley horsley1953@gmail.com wrote:
Well, yes, it can be a hardware failure, but then it's a rather strange coincidence that the crashes started happening immediately
Don't you know that hardware always waits till new software is installed before it breaks? :-).
No, I thought it always waited for the most inopportune moment. An important deadline, to mention just one possibility... :-)
Andras
On 02/26/2011 05:29 PM, Andras Simon wrote:
No, I thought it always waited for the most inopportune moment. An important deadline, to mention just one possibility... :-)
<keyboard alert> I love the whooshing sound deadlines make as they go past, don't you? </alert>
Seriously, though, there's one even more awkward moment for a hardware failure: just as you're booting after a software upgrade, making it as hard as possible to work out what's really going on.
On Sat, 2011-02-26 at 17:48 -0800, Joe Zeff wrote:
I love the whooshing sound deadlines make as they go past, don't you?
;-) There's that feeling of relief as you give in and acknowledge that you've missed it, and decide to work at your own pace, as it's already too late. Or give up, and ditch a bad project.
On Sat, 2011-02-26 at 16:12 -0700, compdoc wrote:
There are no dangerous voltages inside the PSU as long as the power cord is unplugged from the wall.
Wrong! In a country with 240 volt mains, the big capacitor in a switchmode power supply may have around 400 volts across it. For 110 volt countries, it'll have a lesser, but still significant voltage.
It should drain away when unplugged. But that doesn't happen instantly. And, if the power supply has a fault, it may hold a charge for longer than you might expect.
It should drain away when unplugged. But that doesn't happen instantly. And, if the power supply has a fault, it may hold a charge for longer than you might expect.
You can unplug the power cord and hit the power button on the PC, and it will be gone instantly. Also, since the PSU circuit board is bolted down to the chassis, so you aren't likely to come in contact with any leads from any of the capacitors.
PSUs are not like the old CRT TV sets that stored 50,000 volts or so. Its ok to open them and look inside, as long as the power cord is unplugged.
The only reason not to open one is if it's under warranty.
Tim:
It should drain away when unplugged. But that doesn't happen instantly. And, if the power supply has a fault, it may hold a charge for longer than you might expect.
compdoc:
You can unplug the power cord and hit the power button on the PC, and it will be gone instantly.
As I said, if there's a fault in the power supply, you cannot make any such assumptions.
Also, since the PSU circuit board is bolted down to the chassis, so you aren't likely to come in contact with any leads from any of the capacitors.
You don't have to touch a lead, all you have to do is touch a heatsink, or other metal part, that's connected to something with high voltage.
Though, why any ordinary person would be fooling around *INSIDE* their power supply unit, I don't know... Outside, it's low voltage, since the AT days of power supplies (older power supplies may have external power switches for the mains, where it's dead easy to touch the wiring.
PSUs are not like the old CRT TV sets that stored 50,000 volts or so. Its ok to open them and look inside, as long as the power cord is unplugged.
Have you had a 400 volt DC shock? I have, it's not nice. Could have been fatal, under different circumstances. You don't need a drawn out shock to cause harm to people with underlying, or unknown, medical conditions. Or cause a fall and contact with something else that causes injury.
Unless you know electronics, unless you're familiar with servicing equipment that may be faulty (and, therefore not behave in any of the expected ways), stay out of areas which connect directly to the mains, even when the power is disconnected, but especially when it is not.
Servicing is a very different kettle of fish than dabbling in electronics.
- Inside the PSU, look closely at the mid-sized capacitors where the
external wires are soldered to the circuit board. (external wires meaning those that connect to the mainboard and drives) Also, look at the PSU circuit board and its components for indication of high heat. For example, brown areas on the circuit board indicating that something is getting too hot. Components like resistors and rectifiers can get too hot and burn the area around them. Sometimes you'll see black charring where a component has caught on fire. There are no dangerous voltages inside the PSU as long as the power cord is unplugged from the wall.
Electrolytic capacitors can keep a lot of charge for a good hour afterwards. By all means peer in but it's a very had idea to go poking inside a PSU failed or otherwise. Even if it has failed you don't want to try fixing it unless you are a qualified electrician and have appropriate tools for things like earth testing.
PSU's also fail for another common office reason in some configurations, that is the intake of paperclips and staples. Thankfully modern systems seem to be designed to keep intakes away from the such terrors.
Removing the staple and paperclips is also a standard office keyboard service.
- Boot a live cd and read the smart data for the drive(s): use the gnome
disk utility (palimpsest), or use smartctl -a /dev/sda, etc. (substitute your actual device) If Reallocated Sectors is more than 0 (zero) the drive is failing. This counts confirmed sector read/write errors. The sectors are
That is somewhat dubious. Look at the SMART data health check from the drive, that knows far more. There are lots of cases where reallocated sectors is not a problem (and palimpset seems to get this wrong too). In general a modern drive is a storage appliance pretending to be a disk, and it's unwise to treat it otherwise. This particularly applies to things like secure erase which most application level software gets wrong.
disabled by reallocating them. As new bad sectors occur, the computer can freeze or reboot, or programs can crash.
With a reallocated sector the drive has decided a block is problematic and not to use it. That won't cause a problem to an OS except maybe an observed pause as the drive tries to sort the blocks out.
On a bad sector Linux will continue as best it can and you'll rarely see the machine go splat. What can be a problem is that sometimes instead of getting an I/O error you'll get the drive decide to kick the bucket. On SATA a drive should not be able to bring the box down, on PATA in some cases a transaction can lock the machine solid if you are really unlucky.
(Use RAID 1, RAID is good)
Alan
Even if it has failed you don't want to try fixing it unless you are a qualified electrician and have appropriate tools for things like earth testing.
You don't repair failed/damaged PSUs and motherboards - you replace them.
PSU's also fail for another common office reason in some configurations, that is the intake of paperclips and staples. Thankfully modern systems seem to be designed to keep intakes away from the such terrors. Removing the staple and paperclips is also a standard office keyboard service.
Heh. In over 20 years of doing IT for many small and large businesses, I can't remember seeing a paperclip or staple inside a PSU. Not saying it can't happen - one could bounce in through the back vent.
But even if that should be the case, you don't usually get away with simply removing the foreign metal objects - you're going to be replacing the PSU.
And if you want to know why it failed, you're going to look inside the PSU. (if you're the kind of a tech that cares to know)
-----Original Message----- From: Alan Cox [mailto:alan@lxorguk.ukuu.org.uk] Sent: Sunday, February 27, 2011 10:12 AM To: Community support for Fedora users
That is somewhat dubious. Look at the SMART data health check from the drive, that knows far more.
My suggestion was to look at the smart data. One reason for a drive health check warning is due to solely to the drive having reallocated sectors.
In general a modern drive is a storage appliance pretending to be a disk, and it's unwise to treat it otherwise.
Whatever that means. Drives are amazing: self-analyzing, self-adjusting, self-repairing. But they are still just a component that you diagnose. They still fail in the same ways that drives have failed since the beginning. Only now, they can tell you how they're feeling. But only if you look. I suppose someday they'll send emails without need of some monitoring program...
There are lots of cases where reallocated sectors is not a problem
Please explain a situation in which a hard drive developing bad sectors is not a problem.
With a reallocated sector the drive has decided a block is problematic and not to use it. That won't cause a problem to an OS except maybe an observed pause as the drive tries to sort the blocks out. On a bad sector Linux will continue as best it can and you'll rarely see the machine go splat.
You've been lucky if your systems have simply burped during this process. Although, it doesn't sound to me as though you have actually witnessed this.
Things don't always go quite so nicely for linux or windows based systems that are using commodity parts like Seagate, WD, or Samsung sata drives - the kind you find in most desktops and budget servers. I think it could be handled a lot better by the OS than it is. I blame the drivers.
In any case, any tech worth his salt is going to find out what the problem actually is. Going thru a system in the way that I suggested is not only going to help in solving the problem, it should also be part of a anyone's maintenance plan.
If you're just guessing at the problem, you're paying too much for IT.
There are lots of cases where reallocated sectors is not a problem
Please explain a situation in which a hard drive developing bad sectors is not a problem.
The drive has lots of spare sectors. Some drives will even indicate they have reallocated sectors at purchase time. What matters is if the count is increasing. It's also not unknown to get a couple over years and nothing else.
The big problem are sectors that cannot be read, not sectors where the drive has noted a spot problem and moved the data. That and trends. The actual SMART health check the drive provides looks at these and should give best answers as it uses drive internal knowledge.
Google's studies show none of these methods are that great so RAID and/or backups are important [backups are anyway as you can have a PSU fail badly and blow all the attached disks, been there seen that]. Also for RAID pairs use different drives or drives from different sources otherwise you may get two with the same systemic flaw as they came off the production line together, run together exactly as long on your RAID and duly fail close together.
On a bad sector Linux will continue as best it can and you'll rarely see the machine go splat.
You've been lucky if your systems have simply burped during this process. Although, it doesn't sound to me as though you have actually witnessed this.
Chuckle. I was the Linux IDE/ATA disk maintainer for some years. I've seen most of it, including some really bad periods for drive reliability (IBM deathstars and other such fun)
the kind you find in most desktops and budget servers. I think it could be handled a lot better by the OS than it is. I blame the drivers.
Ah good send patches. Unfortunately it's very rarely the drivers.
On a fault we run through a series of things including retrying the command, lowering link speeds and then resetting the device. In the PATA case a device can get stuck with IORDY asserted on the bus which hangs the PC and there is nothing most cards will then do (SIL680 is almost the only exception). Some controllers thoughtfully emulate this idiocy when SATA devices failed, so its a good idea to get an AHCI capable controller in AHCI mode.
The big failure cases we normally see are the drive dropping offline entirely and refusing to come back until physically power cycled. As the power between the PC and the drive is directly wired the OS can't fix this one.
The biggest causes of apparently random failures seem to be people putting too many disks on what PSU output and overheating.
In any case, any tech worth his salt is going to find out what the problem actually is. Going thru a system in the way that I suggested is not only going to help in solving the problem, it should also be part of a anyone's maintenance plan.
If you're just guessing at the problem, you're paying too much for IT.
There is a school of thought that if your IT costs more than just restoring a new box from backup you don't need IT 8)
Alan
On Sat, 2011-02-26 at 16:12 -0700, compdoc wrote:
- While looking in the case, look for bad capacitors on the
motherboard and inside the PSU. Click on the pictures to see close up: http://en.wikipedia.org/wiki/Capacitor_plague
- If you're using two or more sticks of ram and the sticks are
mismatched, try buying matched pairs. Also, look at the sticker on the ram and see if they require that their voltage be set higher than normal.
And while poking around inside the case, do not zap the innards with static shocks. If you don't know about how that happens, and how to avoid it, you should read up about it first.
Taking no anti-static precautions is one of those causes for mysterious failures, usually much a long time after you'd done your poking around.
On 02/27/2011 01:11 PM, Tim wrote:
On Sat, 2011-02-26 at 16:12 -0700, compdoc wrote:
- While looking in the case, look for bad capacitors on the
motherboard and inside the PSU. Click on the pictures to see close up: http://en.wikipedia.org/wiki/Capacitor_plague
- If you're using two or more sticks of ram and the sticks are
mismatched, try buying matched pairs. Also, look at the sticker on the ram and see if they require that their voltage be set higher than normal.
And while poking around inside the case, do not zap the innards with static shocks. If you don't know about how that happens, and how to avoid it, you should read up about it first.
Taking no anti-static precautions is one of those causes for mysterious failures, usually much a long time after you'd done your poking around.
I once killed a motherboard by vaccuming out the dust. Static discharge from the vac brush, I think. Don't make the same mistake, use a blower or canned air instead.
John
On Sun, 2011-02-27 at 13:57 -0800, john wendel wrote:
I once killed a motherboard by vaccuming out the dust. Static discharge from the vac brush, I think. Don't make the same mistake, use a blower or canned air instead.
Hmm, I would have thought blowing air in to be more of an issue than sucking air out. Blasting air in generally has a lot of force, vacuums only have much air force when very close to the nozzle. But with a bit of distance involved, about all it does it draw moving air towards it.
I tend to just use a small paintbrush to dislodge dust so it floats into the air, with a vacuum nozzle near enough to stop the dust settling back down, or me breathing it in. Pipe cleaners are good for getting the fluff out of heatsinks and around fan blades.
On Mon, Feb 28, 2011 at 10:27 AM, Tim ignored_mailbox@yahoo.com.au wrote:
Hmm, I would have thought blowing air in to be more of an issue than sucking air out. Blasting air in generally has a lot of force, vacuums only have much air force when very close to the nozzle.
Vacuums usually kill electronics because they generate static electricity.
On Mon, 2011-02-28 at 10:33 -0500, Ted Roche wrote:
Vacuums usually kill electronics because they generate static electricity.
Well, it's always been the suggestion that it's the *moving air* that generates static electricity, with the velocity being the major factor. The direction shouldn't really matter, though I was pointing out that you generally have more force with an airstream that blows rather than sucks. One only has to have a cleaner with both inlets and outlets attachable to the same hose to be able to notice the difference (expelled air having significant force a few feet away, yet a sucking hose having almost no power a few inches away).
Service shops used to use air compressors (as used for power tools, and inflating vehicle tires) to blast muck out of electronic devices. And, in doing so, would often cause static electricity damage. No vacuum cleaner there, and a similar plastic hose, metal nozzle, issue.
I'd be more concerned about vacuum cleaners with: Getting the nozzle too close, suddenly sucking itself into direct contact, and smacking into delicate parts. Than with a static charge that they might generate.
On Sat, 2011-02-26 at 20:35 +0100, Andras Simon wrote:
Well, yes, it can be a hardware failure, but then it's a rather strange coincidence that the crashes started happening immediately after upgrading to F13 (from F10).
Not really so strange. If you have dodgy hardware, something that stresses it for a prolonged period (such as an install/upgrade, or any other CPU intensive task that's more than the usual workload), can be enough to trigger the fault to worsen.
And in the case of desktop PCs, sometimes you've move the box around during an upgrade, when it's usually left undisturbed, and you've mechanically stressed the system.