Ok. I've tried using alt-sysrq-T to produce a log in netconsole after hang, but could not. It just didn't respond. When operating normally it worked fine (see attached file). One new interesting observation though. After RAID hang and server reboot, the reconstruction process started. Everything was as usual. However one interesting thing happened - I could not reproduce the crash/hang while array was constructing! I even created an extra pressure for array (I ran extra an process writing to it). At first I could not understand that. But then I realized that my reconstruction process uses too much bandwidth to trigger the crash/hang. I used following commands to force quicker reconstruction: echo 100000 > /proc/sys/dev/raid/speed_limit_max echo 50000 > /proc/sys/dev/raid/speed_limit_min echo "idle" > /sys/block/md127/md/sync_action Thus the reconstruction worked at 100Mb/s. Then I decided to check this assumption and while intensively writing to the array and reconstructing simultaneously, I tried issuing following commands: echo 10000 > /proc/sys/dev/raid/speed_limit_max echo 10000 > /proc/sys/dev/raid/speed_limit_min echo "idle" > /sys/block/md127/md/sync_action Guess what? After those executed (see the last lines in attached log), just several seconds later - the crash happened. So, I think there is something wrong with filesystem/RAID co-operation. You can see that while reconstructing, the pressure for filesystem is not enough to reproduce the crash. Could you provide something to move further into debugging the issue? BR, Denis 2014-02-25 4:58 GMT+02:00 NeilBrown <neilb@xxxxxxx>: > On Tue, 25 Feb 2014 00:01:42 +0200 Denis Golovan <denis.golovan@xxxxxxxxx> > wrote: > >> Hi all >> >> I am struggling to diagnose a strange freeze of software RAID5 array. >> My RAID5 consists of 4 Toshiba SATA drives and has ext4 filesystem on top of it. >> >> It works fine unless I start several process writing intensively to it. >> At first, it looks like the system is under high pressure, then the >> system starts lagging a lot and a hard freeze always follows after >> several minutes. >> >> No errors in system log, nothing is emitted to console. Just hard >> freeze with HDD light always on. I tried enabling kernel network >> logging to another machine and again no information when hanging. >> After reboot, my array starts reconstruction and finishes without >> errors. >> >> I tried disabling quotas and barriers for ext4. >> After disabling barriers, it almost seemed to work, but after some >> time the same hard freeze happens. >> >> I tested the same hardware configuration under Linux v3.10, 3.11, 3.12 >> and now 3.13.5 (all x86 arch) behaves the same way. The same issue can >> be reproduced easily. >> >> So now I tested everything Google suggests on the matter. >> Could you give a hint on how to debug this issue? >> > > The most useful thing for debugging a hard freeze is the alt-sysrq-T output > when it is frozen. typing that magic sequence should always produce some > output unless it is hard-frozen with interrupts disabled. > > So make sure you can produce the output when the system is working properly > (to a log file file the network console would be ideal), then when it hangs, > produce the output again. > To probably need to have a text console rather than a graphic console for it > to work. > > > If it is hard-hanging with interrupts disabled, then it gets tricky. I > thought there was some NMI-based lockup detector which would warn if that > happened, but I cannot find it just now. > > NeilBrown
Attachment:
netconsole.txt.gz
Description: GNU Zip compressed data