Re: Re: task xfssyncd blocked while raid5 was in recovery

hanguozhong <hanguozhong@xxxxxxxxxxxx> · Thu, 11 Oct 2012 10:42:47 +0800

From: Stan Hoeppner
Date: 2012-10-10 19:54
To: GuoZhong Han
CC: linux-raid
Subject: Re: task xfssyncd blocked while raid5 was in recovery
On 10/9/2012 10:14 PM, GuoZhong Han wrote:

> Recently, a problem has troubled me for a long time.
> 
> I created a 4*2T (sda, sdb, sdc, sdd) raid5 with XFS file system, 128K
> chuck size and 2048 strip_cache_size. The mdadm 3.2.2, kernel 2.6.38
> and mkfs.xfs 3.1.1 were used. When the raid5 was in recovery and the
> schedule reached 47%, I/O errors occurred in sdb. The following was
> the output:

> ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
> 
> ata2: status=0x41 { DriveReady Error }
> 
> ata2: error=0x04 { DriveStatusError }

> end_request: I/O error, dev sdb, sector 1867304064

>>Run smartctl and post this section:
>>"Vendor Specific SMART Attributes with Thresholds"

>>The drive that is sdb may or may not be bad.  smartctl may tell you
>>(us).  If the drive is not bad you'll need to force relocation of this
>>bad sector to a spare.  If you don't know how we can assist.

I did not save the outputs of smartctl, but I remember that the “RAW_VALUE” 
of the attribute whose name was“Current_Pending_Sector” of sdb was 1. Did that
indicate that the drive was bad? If the drive was not bad, what is the best way to relocate
these bad sectors to spare? I have been using the tool "HDD_Regenerator" running in windows,
which is too slow. Each relocation took dozens of hours. It takes a long time to find the bad sector. 
If you have any better idea, please let me know.

> INFO: task xfssyncd/md127:1058 blocked for more than 120 seconds.

> The output said "INFO: task xfssyncd/md127:1058 blocked for more than
> 120 seconds" What did that mean?

>>Precisely what it says.  It doesn't tell you WHY it was blocked, as it
>>can't know.  The fact that your md array was in recovery and having
>>problems with one of the member drives is a good reason for xfssyncd to
>>block.

>      The state of the raid5 was "PENDING". I had never seen such a
> state of raid5 before. After that, I wrote a program to access the
> raid5, there was no response any more. Then I used "ps aux| grep
> xfssyncd" to see the state of xfssyncd? Unfortunately, there was no
> response yet. Then I tried "ps aux". There were outputs, but the
> program could exit with "ctrl+d" or "ctrl+z". And when I tested the
> write performance for raid5, I/O errors often occurred. I did not know
> why this I/O errors occurred so frequently.
> 
>      What was the problem? Can any one help me?

>>It looks like drive sdb is bad or going bad.  smartctl output or
>>additional testing should confirm this.

>>Also, your "XFS...blocked for 120s" error reminds me there are some
>>known bugs in XFS kernel 2.6.38 which cause a similar error, but are not
>>the cause of your error.  Yours is a drive problem.  Nonetheless, there
>>have been dozens of XFS bugs fixed since 2.6.38 and I recommend you
>>upgrade to kernel 3.2.31 or 3.4.13 if you roll your own kernels.  If you
>>use distro kernels get the latest 3.x series in the repos.

Hm, I searched "XFS.. blocked for 120s" on google. There was a same problem 
with me, the flowing is the link: 
http://lists.debian.org/debian-kernel/2012/01/msg00274.html
I use the tilera platform, so it is hard to upgrade to kernel 3.2.31 or 3.4.13. ?韬{.n?????%??檩??w?{.n???{炳盯w???塄}?财??j:+v??????2??璀??摺?囤??z夸z罐?+?????w棹f