On Thu, Oct 11, 2012 at 11:55:01AM +0800, 韩国中 wrote: > Hello, every one: > > Recently, a problem has troubled me for a long time. > > I created a 4*2T (sda, sdb, sdc, sdd) raid5 with XFS file system, 128K > chuck size and 2048 strip_cache_size. The mdadm 3.2.2, kernel 2.6.38 > and mkfs.xfs 3.1.1 were used. When the raid5 was in recovery and the > schedule reached 47%, I/O errors occurred in sdb. The following was > the output: > > ...... > > ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00 > > ata2: status=0x41 { DriveReady Error } Looks like you've had a drive fail during rebuild. > Then, there were lots of error messages about the file system. The > following was the output: > > > > ...... > > INFO: task xfssyncd/md127:1058 blocked for more than 120 seconds. > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > xfssyncd/md127 D fffffff7000216d0 0 1058 2 0x00000000 > frame 0: 0xfffffff700020570 __switch_to+0x1b8/0x1c0 (sp 0xfffffe008d7ff900) > frame 1: 0xfffffff7000216d0 schedule+0x918/0x1538 (sp 0xfffffe008d7ff9d0) > frame 2: 0xfffffff700022a90 schedule_timeout+0x268/0x5b0 (sp 0xfffffe008d7ffd18) > frame 3: 0xfffffff700024ee0 __down+0xd8/0x158 (sp 0xfffffe008d7ffda8) > frame 4: 0xfffffff70085da78 down.cold+0x8/0x28 (sp 0xfffffe008d7ffe18) > frame 5: 0xfffffff700750788 xfs_buf_lock+0xd0/0x120 (sp 0xfffffe008d7ffe38) > frame 6: 0xfffffff700821b40 xfs_getsb+0x38/0x78 (sp 0xfffffe008d7ffe50) > frame 7: 0xfffffff70077e230 xfs_trans_getsb+0xe0/0x100 (sp 0xfffffe008d7ffe68) > frame 8: 0xfffffff7006babc0 xfs_mod_sb+0x88/0x198 (sp 0xfffffe008d7ffe88) > frame 9: 0xfffffff7007a6480 xfs_fs_log_dummy+0x68/0xe0 (sp 0xfffffe008d7ffeb8) > frame 10: 0xfffffff70079c6c0 xfs_sync_worker+0xe0/0xe8 (sp 0xfffffe008d7ffed8) > frame 11: 0xfffffff700570a00 xfssyncd+0x240/0x328 (sp 0xfffffe008d7ffef0) > frame 12: 0xfffffff7000f0530 kthread+0xe0/0xe8 (sp 0xfffffe008d7fff80) > frame 13: 0xfffffff7000bab38 start_kernel_thread+0x18/0x20 (sp 0xfffffe008d7fffe8) Which is basically saying that the superblock buffer is under IO - that's the only reason it ever gets locked. > The output said “INFO: task xfssyncd/md127:1058 blocked for more than > 120 seconds? What did that mean? I used “cat /proc/mdstat?to see the > state of the raid5. The output was: > > Personalities : [raid0] [raid6] [raid5] [raid4] > > md127 : active raid5 sdd[3] sdc[2] sdb[1](F) sda[0] > > 5860540032 blocks super 1.2 level 5, 128k chunk, algorithm 2 [4/3] [U_UU] > > resync=PENDING > > unused devices: <none> > > > The state of the raid5 was “PENDING? I had never seen such a > state of raid5 when I used ext4. After that, I wrote a program to access the > raid5, there was no response any more. Waiting on IO to complete, but with the MD device down, it will enver complete. > Then I used “ps aux| task > xfssyncd?to see the state of “xfssyncd? Unfortunately, there was no > response yet. Then I tried “ps aux? There were outputs, but the > program could exit with “Ctrl+d? or “Ctrl+z? And when I tested the > write performance for raid5, I/O errors often occurred. I did not know > why this I/O errors occurred so frequently. > > What was the problem? Can any one help me? Broken hardware causing MD to go into a bad state, which causes XFS to stall because it can't make progress. Bottom line: replace the broken disk, though given that MD was already rebuilding a RAID5 when the disk died, you probably have lost everything on the filesystem.... Cheers, Dave. -- Dave Chinner david@xxxxxxxxxxxxx _______________________________________________ xfs mailing list xfs@xxxxxxxxxxx http://oss.sgi.com/mailman/listinfo/xfs