On 11/29/2011 01:48 PM, Gordan Bobic wrote:
On 11/29/2011 01:45 PM, Peter Robinson wrote:
> On Tue, Nov 29, 2011 at 1:30 PM, Gordan Bobic<gordan(a)bobich.net> wrote:
>> Guys,
>>
>> After chasing my tail for ages thinking I had a hardware issue on an
>> AC100, it looks like the random segfaults and "glibc detected a
>> corrupted doubly linked list" errors might actually be SMP and/or ARMv7
>> related.
>>
>> Errors:
>> - random segfaults
>> - glibc detected a corrupted doubly linked list
>>
>> Distro: Fedora 13
>>
>> Platforms that work flawlessly (24/7 compiling for weeks):
>> - Marvell Kirkwood (1x SheevaPlug, 1x DreamPlug).
>>
>> Platforms that cause repeatable segfaults (same rootfs, same operation):
>> - Tegra2 (tested using Toshiba AC100 and Compulab TrimSlice)
>> - OMAP 4xxx (tested on a PandaBoard)
>>
>> I'm going to dig into this deeper (boot the machine with nosmp or
>> tasksetting everything to run on the same core), but in the meantime I
>> would like to ask if there is a bug in any of the following:
>>
>> - glibc
>> - gcc
>> - binutils
>>
>> that might cause them to misbehave either on:
>> - ARMv7 (armv5tel packages on armv7l kernel)
>> or
>> - SMP ARM systems
>> (or both)
>>
>> I'm going to compile up a clean kernel (without all the hacks I tried on
>> the AC100 to try to troubleshoot the issue) and try building the
>> packages in a clean F13 mock just to do a definitive confirmation pass,
>> but if anyone is aware of any such issues (e.g. due to locking
>> primitives being different on ARMv7) that have been fixed in
>> glibc/gcc/binutils recently, I would appreciate any info you may have on
>> the subject.
>>
>> Ubuntu doesn't appear to suffer from this issue, but they use a much
>> newer gcc and a different glibc than what is in F13.
One other thing - one of the manifestations of this bug appears to be
random memory corruption (strange, I know - unless I am dealing with two
totally unrelated problems). Specifically, I have seen the bug manifest
during compile jobs where, for example, linking would segfault, and
re-making would segfault again. But doing:
echo 3 > /proc/sys/vm/drop_caches
would fix the problem.
My first suspicion was duff hardware/RAM on my AC100. So I got another
one, and it behaves in the exact same way.
Then I thought that maybe they are all pre-overclocked past stable
points, so I started hacking at the kernel to drop clock speeds and
memory timings (they are bootloader and kernel settable on Tegra2), and
none of that made any difference (apart from making the machine slower -
the instability remained).
Then I started looking at possible Tegra2 specific bugs, like the TLS
register bug. Couldn't get to any conclusive results on that,
unfortunately, but nobody running Ubuntu seems to have seen any similar
issues on the same hardware.
A couple of days ago somebody on #AC100 offered to re-run my test
(building hsqldb src.rpm in mock) on their TrimSlice and on their
PandaBoard to try to establish whether the problem might be SMP and/or
ARMv7 specific (since I get no stability issues at all on my single-core
Kirkwood devices. And sure enough - they saw the same random segfaults
arise on BOTH the TrimSlice (Tegra2 A9 SMP) _AND_ the PandaBoard (OMAP
4xxx A9 SMP).
Which implies that the problem is to do with either SMP or running on
ARMv7 CPUs, which would indicate an issue with either the glibc or the
toolchain, but that is just guessing at the moment.
Any suggestions welcome at this point.
Gordan