Need help with Reboot cause
m
maximilianbianco at gmail.com
Tue Apr 7 18:18:42 UTC 2009
On Tue, Apr 07, 2009 at 10:41:43AM -0700, Peter J. Stieber wrote:
> PS = Pete Stieber
> PS>> I have a dual opteron system that has been acting as
> PS>> the worldly node for a small cluster of computers
> PS>> since September, 2004. The machine is running the
> PS>> latest x86_64 Fedora 10 kernel that I recently loaded
> PS>> (April 2). The machine reboots without warning. I
> PS>> can't find the cause in log files (maybe I'm not
> PS>> looking in the correct log).
> PS>>
> PS>> I'm currently running memtest. If all of the tests
> PS>> pass, could the community suggest other diagnostic
> PS>> tasks or information I could post to help diagnose the
> PS>> problem?
>
> m> Have you tried going back to the previous kernel?
>
> The machine is still running memtest (no errors so far), but I already
> removed the prior kernel. I did notice reboots with the prior kernel.
> BTW my current kernel is 2.6.27.21-170.2.56.fc10.x86_64.
>
If it reboots with prior kernels then i would do a thorough check of the hardware first but you may look for known issues reported against your particular hardware setup, since it may be a known issue
> Reboots indicated by information in /var/log/messages...
>
> Sunday March 29 4:08
> Tuesday March 31 7:02
> Thursday April 2 18:27 Intentional reboot due to new kernel
> Friday April 3 1:36
> Sunday April 5 1:37
> Sunday April 5 2:48
> Sunday April 5 9:43
> Sunday April 5 13:20 as I was typing this email
>
> m> Did you check dmesg and /var/log/messages?
>
> Yes. I can see reboots, but not the cause.
>
> m> Does it boot normally and then just fail at some random
> m> interval or is it consistently failing at the same point?
>
> I have had top running during a few of the reboots. I have forced a
> couple of them by starting my nightly build process. The linker/loader
> has been running during some of the reboots...
>
> top - 13:19:53 up 3:36, 6 users, load average: 1.27, 2.70, 2.32
> Tasks: 138 total, 6 running, 132 sleeping, 0 stopped, 0 zombie
> Cpu(s): 40.8%us, 13.8%sy, 0.0%ni, 42.5%id, 2.7%wa, 0.0%hi, 0.3%si,
> 0.0%st
> Mem: 2060232k total, 1683996k used, 376236k free, 164484k buffers
> Swap: 2031608k total, 56k used, 2031552k free, 1230796k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 8878 pstieber 20 0 34552 25m 1096 R 7.6 1.3 0:00.23 ld
> 8884 pstieber 20 0 48284 27m 1080 R 5.0 1.4 0:00.15 ld
> 7 root 15 -5 0 0 0 S 0.3 0.0 0:00.17 ksoftirqd/1
> 22427 pstieber 20 0 14880 1208 872 R 0.3 0.1 0:03.49 top
> 1 root 20 0 4096 876 616 S 0.0 0.0 0:00.71 init
>
> Another instance
>
> top - 06:55:13 up 17:34, 2 users, load average: 2.83, 2.59, 1.86
> Tasks: 127 total, 2 running, 125 sleeping, 0 stopped, 0 zombie
> Cpu(s): 45.1%us, 4.7%sy, 0.0%ni, 49.8%id, 0.5%wa, 0.0%hi, 0.0%si,
> 0.0%st
> Mem: 2060232k total, 1763404k used, 296828k free, 177052k buffers
> Swap: 2031608k total, 56k used, 2031552k free, 1271964k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 5757 pstieber 20 0 79788 69m 1080 R 12.3 3.5 0:00.37 ld
> 1 root 20 0 4096 876 616 S 0.0 0.0 0:00.68 init
> 2 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kthreadd
>
> I'm not sure this is always the case.
>
Might be worth finding out...
> m> Other things you may consider:
> m> CPU type?
>
> Motherboard: Tyan Thunder K8W (S2885ANRF)
> CPUs: Dual Opteron 244 (1.8 GHz) processors
> Memory: 2 GB 4-512MB CT6472Y40B DDR PC3200 from Crucial
>
> m> temperature?
>
> Is there a command to monitor this while running the OS?
there is a gnome widget for this or there was and it required some configuration...from the CLI i am not sure how to go about it but usually the BIOS has the temp and this will be good enough to start with
>
> m> potential hard drive issue?
>
> I have 3 SATA drives running. It's been so long since I have done this,
> but how does one manually do a disk chack?
>
I think you would do better with a dedicated hard drive test like Hitachi makes available, but i am forgetting about smartctl!! Still two sets of independent results are better than one so maybe do both if you have the time. I usually start with hitachi (works with non hitachi drives) and if that passes I move on to try other things but fsck first.
man fsck
man smartctl
> m> any new hardware attached or installed recently?
>
> No
>
> m> Notice any power surges or brownouts?
>
> The machine is on a UPS that deals with this.
>
> m> any other nodes having issues?
>
> No and they are not on UPSs. They also do not have as large of a work load.
>
> The machine in question is used for nightly builds and regression tests.
> I use distcc with the compute nodes to perform the builds.
>
> The machine also runs samba to provide a network share to Windows users
> and provides authentication using Windows domain accounts.
>
> m> Recent power surge zapped a board, DSL modem,
> m> and the surge protector.
>
> I doubt this is the problem.
>
> Memtest make it through the first pass of all test successfully.
Be sure to let it run for as long as you can 12 - 24 hours would be ideal, some errors don't show up right away or only with continous use
>
> Thanks for the suggestions, especially considering my vague information.
>
>
> --
> fedora-list mailing list
> fedora-list at redhat.com
> To unsubscribe: https://www.redhat.com/mailman/listinfo/fedora-list
> Guidelines: http://fedoraproject.org/wiki/Communicate/MailingListGuidelines
--
"Any fool can know. The point is to understand" --Albert Einstein
Bored??
http://fiction.wikia.com/wiki/Fuqwit1.0
http://fiction.wikia.com/wiki/Coding_the_Magic_into_the_Eight_Ball
More information about the users
mailing list