Re: task xfssyncd blocked while raid5 was in recovery

Stan Hoeppner <stan@xxxxxxxxxxxxxxxxx> · Wed, 10 Oct 2012 06:54:19 -0500

On 10/9/2012 10:14 PM, GuoZhong Han wrote:

> Recently, a problem has troubled me for a long time.
> 
> I created a 4*2T (sda, sdb, sdc, sdd) raid5 with XFS file system, 128K
> chuck size and 2048 strip_cache_size. The mdadm 3.2.2, kernel 2.6.38
> and mkfs.xfs 3.1.1 were used. When the raid5 was in recovery and the
> schedule reached 47%, I/O errors occurred in sdb. The following was
> the output:

> ata2: translated ATA stat/err 0x41/04 to SCSI SK/ASC/ASCQ 0xb/00/00
> 
> ata2: status=0x41 { DriveReady Error }
> 
> ata2: error=0x04 { DriveStatusError }
<snip repeated log entries>

> end_request: I/O error, dev sdb, sector 1867304064

Run smartctl and post this section:
"Vendor Specific SMART Attributes with Thresholds"

The drive that is sdb may or may not be bad.  smartctl may tell you
(us).  If the drive is not bad you'll need to force relocation of this
bad sector to a spare.  If you don't know how we can assist.

> INFO: task xfssyncd/md127:1058 blocked for more than 120 seconds.

> The output said “INFO: task xfssyncd/md127:1058 blocked for more than
> 120 seconds”. What did that mean?

Precisely what it says.  It doesn't tell you WHY it was blocked, as it
can't know.  The fact that your md array was in recovery and having
problems with one of the member drives is a good reason for xfssyncd to
block.

>      The state of the raid5 was “PENDING”. I had never seen such a
> state of raid5 before. After that, I wrote a program to access the
> raid5, there was no response any more. Then I used “ps aux| task
> xfssyncd” to see the state of “xfssyncd”. Unfortunately, there was no
> response yet. Then I tried “ps aux”. There were outputs, but the
> program could exit with “Ctrl+d” or “Ctrl+z”. And when I tested the
> write performance for raid5, I/O errors often occurred. I did not know
> why this I/O errors occurred so frequently.
> 
>      What was the problem? Can any one help me?

It looks like drive sdb is bad or going bad.  smartctl output or
additional testing should confirm this.

Also, your "XFS...blocked for 120s" error reminds me there are some
known bugs in XFS kernel 2.6.38 which cause a similar error, but are not
the cause of your error.  Yours is a drive problem.  Nonetheless, there
have been dozens of XFS bugs fixed since 2.6.38 and I recommend you
upgrade to kernel 3.2.31 or 3.4.13 if you roll your own kernels.  If you
use distro kernels get the latest 3.x series in the repos.

-- 
Stan

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html