On 11/29/2011 01:48 PM, Gordan Bobic wrote: > On 11/29/2011 01:45 PM, Peter Robinson wrote: >> On Tue, Nov 29, 2011 at 1:30 PM, Gordan Bobic<gordan@xxxxxxxxxx> wrote: >>> Guys, >>> >>> After chasing my tail for ages thinking I had a hardware issue on an >>> AC100, it looks like the random segfaults and "glibc detected a >>> corrupted doubly linked list" errors might actually be SMP and/or ARMv7 >>> related. >>> >>> Errors: >>> - random segfaults >>> - glibc detected a corrupted doubly linked list >>> >>> Distro: Fedora 13 >>> >>> Platforms that work flawlessly (24/7 compiling for weeks): >>> - Marvell Kirkwood (1x SheevaPlug, 1x DreamPlug). >>> >>> Platforms that cause repeatable segfaults (same rootfs, same operation): >>> - Tegra2 (tested using Toshiba AC100 and Compulab TrimSlice) >>> - OMAP 4xxx (tested on a PandaBoard) >>> >>> I'm going to dig into this deeper (boot the machine with nosmp or >>> tasksetting everything to run on the same core), but in the meantime I >>> would like to ask if there is a bug in any of the following: >>> >>> - glibc >>> - gcc >>> - binutils >>> >>> that might cause them to misbehave either on: >>> - ARMv7 (armv5tel packages on armv7l kernel) >>> or >>> - SMP ARM systems >>> (or both) >>> >>> I'm going to compile up a clean kernel (without all the hacks I tried on >>> the AC100 to try to troubleshoot the issue) and try building the >>> packages in a clean F13 mock just to do a definitive confirmation pass, >>> but if anyone is aware of any such issues (e.g. due to locking >>> primitives being different on ARMv7) that have been fixed in >>> glibc/gcc/binutils recently, I would appreciate any info you may have on >>> the subject. >>> >>> Ubuntu doesn't appear to suffer from this issue, but they use a much >>> newer gcc and a different glibc than what is in F13. One other thing - one of the manifestations of this bug appears to be random memory corruption (strange, I know - unless I am dealing with two totally unrelated problems). Specifically, I have seen the bug manifest during compile jobs where, for example, linking would segfault, and re-making would segfault again. But doing: echo 3 > /proc/sys/vm/drop_caches would fix the problem. My first suspicion was duff hardware/RAM on my AC100. So I got another one, and it behaves in the exact same way. Then I thought that maybe they are all pre-overclocked past stable points, so I started hacking at the kernel to drop clock speeds and memory timings (they are bootloader and kernel settable on Tegra2), and none of that made any difference (apart from making the machine slower - the instability remained). Then I started looking at possible Tegra2 specific bugs, like the TLS register bug. Couldn't get to any conclusive results on that, unfortunately, but nobody running Ubuntu seems to have seen any similar issues on the same hardware. A couple of days ago somebody on #AC100 offered to re-run my test (building hsqldb src.rpm in mock) on their TrimSlice and on their PandaBoard to try to establish whether the problem might be SMP and/or ARMv7 specific (since I get no stability issues at all on my single-core Kirkwood devices. And sure enough - they saw the same random segfaults arise on BOTH the TrimSlice (Tegra2 A9 SMP) _AND_ the PandaBoard (OMAP 4xxx A9 SMP). Which implies that the problem is to do with either SMP or running on ARMv7 CPUs, which would indicate an issue with either the glibc or the toolchain, but that is just guessing at the moment. Any suggestions welcome at this point. Gordan _______________________________________________ arm mailing list arm@xxxxxxxxxxxxxxxxxxxxxxx https://admin.fedoraproject.org/mailman/listinfo/arm