Hi again I was contacted by a person who suggested to double-check vfs_cache_pressure setting. And it appeared that I had this setting set to 10000. That was a left-over from previously debugging OOM-killer case. When I removed this setting from sysctl.conf, I was able to greatly increase the time to crash/freeze. My server was able to withstand about a day of continuous write test. Nevertheless, it froze after that. Still it looks like something is wrong with RAID/filesystem co-operation. I would still like to debug the problem. Please help. BR, Denis 2014-02-26 22:52 GMT+02:00 Denis Golovan <denis.golovan@xxxxxxxxx>: > Ok. > > I've tried using alt-sysrq-T to produce a log in netconsole after > hang, but could not. > It just didn't respond. > When operating normally it worked fine (see attached file). > > One new interesting observation though. > > After RAID hang and server reboot, the reconstruction process started. > Everything was as usual. However one interesting thing happened - I > could not reproduce the crash/hang while array was constructing! I > even created an extra pressure for array (I ran extra an process > writing to it). > > At first I could not understand that. But then I realized that my > reconstruction process uses too much bandwidth to trigger the > crash/hang. I used following commands to force quicker reconstruction: > > echo 100000 > /proc/sys/dev/raid/speed_limit_max > echo 50000 > /proc/sys/dev/raid/speed_limit_min > echo "idle" > /sys/block/md127/md/sync_action > > Thus the reconstruction worked at 100Mb/s. > > Then I decided to check this assumption and while intensively writing > to the array and reconstructing simultaneously, I tried issuing > following commands: > > echo 10000 > /proc/sys/dev/raid/speed_limit_max > echo 10000 > /proc/sys/dev/raid/speed_limit_min > echo "idle" > /sys/block/md127/md/sync_action > > > Guess what? After those executed (see the last lines in attached log), > just several seconds later - the crash happened. > > So, I think there is something wrong with filesystem/RAID co-operation. > You can see that while reconstructing, the pressure for filesystem is > not enough to reproduce the crash. > > Could you provide something to move further into debugging the issue? > > BR, > Denis > > 2014-02-25 4:58 GMT+02:00 NeilBrown <neilb@xxxxxxx>: >> On Tue, 25 Feb 2014 00:01:42 +0200 Denis Golovan <denis.golovan@xxxxxxxxx> >> wrote: >> >>> Hi all >>> >>> I am struggling to diagnose a strange freeze of software RAID5 array. >>> My RAID5 consists of 4 Toshiba SATA drives and has ext4 filesystem on top of it. >>> >>> It works fine unless I start several process writing intensively to it. >>> At first, it looks like the system is under high pressure, then the >>> system starts lagging a lot and a hard freeze always follows after >>> several minutes. >>> >>> No errors in system log, nothing is emitted to console. Just hard >>> freeze with HDD light always on. I tried enabling kernel network >>> logging to another machine and again no information when hanging. >>> After reboot, my array starts reconstruction and finishes without >>> errors. >>> >>> I tried disabling quotas and barriers for ext4. >>> After disabling barriers, it almost seemed to work, but after some >>> time the same hard freeze happens. >>> >>> I tested the same hardware configuration under Linux v3.10, 3.11, 3.12 >>> and now 3.13.5 (all x86 arch) behaves the same way. The same issue can >>> be reproduced easily. >>> >>> So now I tested everything Google suggests on the matter. >>> Could you give a hint on how to debug this issue? >>> >> >> The most useful thing for debugging a hard freeze is the alt-sysrq-T output >> when it is frozen. typing that magic sequence should always produce some >> output unless it is hard-frozen with interrupts disabled. >> >> So make sure you can produce the output when the system is working properly >> (to a log file file the network console would be ideal), then when it hangs, >> produce the output again. >> To probably need to have a text console rather than a graphic console for it >> to work. >> >> >> If it is hard-hanging with interrupts disabled, then it gets tricky. I >> thought there was some NMI-based lockup detector which would warn if that >> happened, but I cannot find it just now. >> >> NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html