From: Stan Hoeppner Date: 2012-10-10 19:54 To: GuoZhong Han CC: linux-raid Subject: Re: task xfssyncd blocked while raid5 was in recovery On 10/9/2012 10:14 PM, GuoZhong Han wrote: > Recently, a problem has troubled me for a long time. > > I created a 4*2T (sda, sdb, sdc, sdd) raid5 with XFS file system, 128K > chuck size and 2048 strip_cache_size. The mdadm 3.2.2, kernel 2.6.38 > and mkfs.xfs 3.1.1 were used. When the raid5 was in recovery and the > schedule reached 47%, I/O errors occurred in sdb. The following was > the output: > ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00 > > ata2: status=0x41 { DriveReady Error } > > ata2: error=0x04 { DriveStatusError } > end_request: I/O error, dev sdb, sector 1867304064 >>Run smartctl and post this section: >>"Vendor Specific SMART Attributes with Thresholds" >>The drive that is sdb may or may not be bad. smartctl may tell you >>(us). If the drive is not bad you'll need to force relocation of this >>bad sector to a spare. If you don't know how we can assist. I did not save the outputs of smartctl, but I remember that the “RAW_VALUE” of the attribute whose name was“Current_Pending_Sector” of sdb was 1. Did that indicate that the drive was bad? If the drive was not bad, what is the best way to relocate these bad sectors to spare? I have been using the tool "HDD_Regenerator" running in windows, which is too slow. Each relocation took dozens of hours. It takes a long time to find the bad sector. If you have any better idea, please let me know. > INFO: task xfssyncd/md127:1058 blocked for more than 120 seconds. > The output said "INFO: task xfssyncd/md127:1058 blocked for more than > 120 seconds" What did that mean? >>Precisely what it says. It doesn't tell you WHY it was blocked, as it >>can't know. The fact that your md array was in recovery and having >>problems with one of the member drives is a good reason for xfssyncd to >>block. > The state of the raid5 was "PENDING". I had never seen such a > state of raid5 before. After that, I wrote a program to access the > raid5, there was no response any more. Then I used "ps aux| grep > xfssyncd" to see the state of xfssyncd? Unfortunately, there was no > response yet. Then I tried "ps aux". There were outputs, but the > program could exit with "ctrl+d" or "ctrl+z". And when I tested the > write performance for raid5, I/O errors often occurred. I did not know > why this I/O errors occurred so frequently. > > What was the problem? Can any one help me? >>It looks like drive sdb is bad or going bad. smartctl output or >>additional testing should confirm this. >>Also, your "XFS...blocked for 120s" error reminds me there are some >>known bugs in XFS kernel 2.6.38 which cause a similar error, but are not >>the cause of your error. Yours is a drive problem. Nonetheless, there >>have been dozens of XFS bugs fixed since 2.6.38 and I recommend you >>upgrade to kernel 3.2.31 or 3.4.13 if you roll your own kernels. If you >>use distro kernels get the latest 3.x series in the repos. Hm, I searched "XFS.. blocked for 120s" on google. There was a same problem with me, the flowing is the link: http://lists.debian.org/debian-kernel/2012/01/msg00274.html I use the tilera platform, so it is hard to upgrade to kernel 3.2.31 or 3.4.13. ?韬{.n?????%??檩??w?{.n???{炳盯w???塄}?财??j:+v??????2??璀??摺?囤??z夸z罐?+?????w棹f