On Sat, 27 May 2006, Neil Brown wrote: > On Friday May 26, dean@xxxxxxxxxx wrote: > > On Tue, 23 May 2006, Neil Brown wrote: > > > > i applied them against 2.6.16.18 and two days later i got my first hang... > > below is the stripe_cache foo. > > > > thanks > > -dean > > > > neemlark:~# cd /sys/block/md4/md/ > > neemlark:/sys/block/md4/md# cat stripe_cache_active > > 255 > > 0 preread > > bitlist=0 delaylist=255 > > neemlark:/sys/block/md4/md# cat stripe_cache_active > > 255 > > 0 preread > > bitlist=0 delaylist=255 > > neemlark:/sys/block/md4/md# cat stripe_cache_active > > 255 > > 0 preread > > bitlist=0 delaylist=255 > > Thanks. This narrows it down quite a bit... too much infact: I can > now say for sure that this cannot possible happen :-) heheh. fwiw the box has traditionally been rock solid.. it's ancient though... dual p3 750 w/440bx chipset and pc100 ecc memory... 3ware 7508 w/seagate 400GB disks... i really don't suspect the hardware all that much because the freeze seems to be rather consistent as to time of day (overnight while i've got 3x rdiff-backup, plus bittorrent, plus updatedb going). unfortunately it doesn't happen every time... but every time i've unstuck the box i've noticed those processes going. other tidbits... md4 is a lvm2 PV ... there are two LVs, one with ext3 and one with xfs. > Two things that might be helpful: > 1/ Do you have any other patches on 2.6.16.18 other than the 3 I > sent you? If you do I'd like to see them, just in case. it was just 2.6.16.18 plus the 3 you sent... i attached the .config (it's rather full -- based off debian kernel .config). maybe there's a compiler bug: gcc version 4.0.4 20060507 (prerelease) (Debian 4.0.3-3) > 2/ The message.gz you sent earlier with the > echo t > /proc/sysrq-trigger > trace in it didn't contain information about md4_raid5 - the > controlling thread for that array. It must have missed out > due to a buffer overflowing. Next time it happens, could you > to get this trace again and see if you can find out what > what md4_raid5 is going. Maybe do the 'echo t' several times. > I think that you need a kernel recompile to make the dmesg > buffer larger. ok i'll set CONFIG_LOG_BUF_SHIFT=18 and rebuild ... note that i'm going to include two more patches in this next kernel: http://lkml.org/lkml/2006/5/23/42 http://arctic.org/~dean/patches/linux-2.6.16.5-no-treason.patch the first was the Jens Axboe patch you mentioned here recently (for accounting with i/o barriers)... and the second gets rid of the tcp treason uncloaked messages. > Thanks for your patience - this must be very frustrating for you. fortunately i'm the primary user of this box... and the bug doesn't corrupt anything... and i can unstick it easily :) so it's not all that frustrating actually. -dean
Attachment:
config.gz
Description: Binary data