On Fri, May 04, 2012 at 05:32:37PM +0100, Brian Candler wrote: > On Thu, May 03, 2012 at 05:19:41PM -0500, Stan Hoeppner wrote: > > Glad to hear you've got one running somewhat stable. Could be a driver > > problem, but it's pretty rare for a SCSI driver to hard lock a box isn't > > it? No. The hardware does something bad to the PCI bus, or DMAs something over kernel memory, or won't de-assert and interrupt line, or .... and the system will hard hang. Hell, if it just stops and you run out of memory because IO is needed to clean and free memory, then system can hang there as well.... > > Keep us posted. > > Last night I fired up two more instances of bonnie++ on that box, so there > were four at once. Going back to the box now, I find that they have all > hung :-( > > They are stuck at: > > Delete files in random order... > Stat files in random order... > Stat files in random order... > Stat files in sequential order... > > respectively. > > iostat 5 shows no activity. There are 9 hung processes: > > $ uptime > 17:23:35 up 1 day, 20:39, 1 user, load average: 9.04, 9.08, 8.91 > $ ps auxwww | grep " D" | grep -v grep > root 35 1.5 0.0 0 0 ? D May02 42:10 [kswapd0] > root 1179 0.0 0.0 0 0 ? D May02 1:50 [xfsaild/md126] > root 3127 0.0 0.0 25096 312 ? D 16:55 0:00 /usr/lib/postfix/master > tomi 29138 1.1 0.0 378860 3708 pts/1 D+ 12:43 3:06 bonnie++ -d /disk/scratch/test -s 16384k -n 98:800k:500k:1000 > tomi 29390 1.0 0.0 378860 3560 pts/3 D+ 12:52 2:53 bonnie++ -d /disk/scratch/test -s 16384k -n 98:800k:500k:1000 > tomi 30356 1.1 0.0 378860 3512 pts/2 D+ 13:32 2:36 bonnie++ -d /disk/scratch/testb -s 16384k -n 98:800k:500k:1000 > root 31075 0.0 0.0 0 0 ? D 14:00 0:04 [kworker/0:0] > tomi 31796 0.6 0.0 378860 3864 pts/4 D+ 14:30 1:05 bonnie++ -d /disk/scratch/testb -s 16384k -n 98:800k:500k:1000 > root 31922 0.0 0.0 0 0 ? D 14:35 0:00 [kworker/1:0] > > dmesg shows hung tasks and backtraces, starting with: > > [150927.599920] INFO: task kswapd0:35 blocked for more than 120 seconds. > [150927.600263] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [150927.600698] kswapd0 D ffffffff81806240 0 35 2 0x00000000 > [150927.600704] ffff880212389330 0000000000000046 ffff880212389320 ffffffff81082df5 > [150927.600710] ffff880212389fd8 ffff880212389fd8 ffff880212389fd8 0000000000013780 > [150927.600715] ffff8802121816f0 ffff88020e538000 ffff880212389320 ffff88020e538000 > [150927.600719] Call Trace: > [150927.600728] [<ffffffff81082df5>] ? __queue_work+0xe5/0x320 > [150927.600733] [<ffffffff8165a55f>] schedule+0x3f/0x60 > [150927.600739] [<ffffffff814e82c6>] md_flush_request+0x86/0x140 > [150927.600745] [<ffffffff8105f990>] ? try_to_wake_up+0x200/0x200 > [150927.600756] [<ffffffffa0010419>] raid0_make_request+0x119/0x1c0 [raid0] That's most likely a hardware or driver problem - the IO request queue is full which means that IO completions are not occurring or being delayed excessively. The problem is below the level of the filesystem.... > I am completely at a loss with all this... I've never seen a Unix/Linux > system behave so unreliably. If you are buying bottom of the barrel hardware, then you get the reliability that you pay for. Spend a few more dollars and buy something that is properly engineered - you've wasted more money trying to diagnose this problem that you would have saved by being cheap hardware.... > One of the company's directors has reminded me > that we have a Windows storage server with 48 disks which has been running > without incident for the last 3 or 4 years, and I don't have a good answer > for that :-( If you buy bottom of the barrel hardware for Windows servers, then you'll get similar results, only they'll be much harder to diagnose. Software can't fix busted hardware... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs