Re: RAID-10 keeps aborting

Joe Lawrence <joe.lawrence@xxxxxxxxxxx> · Mon, 10 Jun 2013 10:15:05 -0400

[Cc: Neil -- a few questions if you want to skip to the bottom ]

I looked a little more into the RAID1 case and tried a few things.
First, to enable WRITE SAME on RAID1, I had to apply the patch at the
bottom of this mail. With the patch in place, the component disk limits
propogate up to their MD:

  [2:0:4:0]    disk    SANBLAZE VLUN0001         0001  /dev/sds 
  [3:0:2:0]    disk    SEAGATE  ST9146853SS      0004  /dev/sdv

  /sys/class/scsi_disk/2:0:4:0/max_write_same_blocks: 65535
  /sys/class/scsi_disk/3:0:2:0/max_write_same_blocks: 65535
  /sys/block/md125/queue/write_same_max_bytes: 33553920

I was interested in observing how a failed WRITE SAME would interact
with MD, the intent bitmap, and resyncing. 

In my setup, I created a RAID1 out of a 4G partition from a SAS disk
(supporting WRITE SAME) and a SanBlaze VirtualLUN (claiming WRITE SAME
support, but returing [sense_key,asc,ascq]: [0x05,0x20,0x00] on the
first such command.)  I also added an external write intent bitmap file
chunk size of 4KB to create a large, granular bitmap:

  mdadm --create /dev/md125 --raid-devices=2 --level=1 \
        --bitmap=/mnt/bitmap/file --bitmap-chunk=4K /dev/sds1 /dev/sdv1

After creating the RAID and letting the initial synchronization finish,
I filled the entire MD with random data.  I would use this to verify
resync using the write intent bitmap later.

>From previous tests, I knew that the first failed WRITE SAME to the
VirtualLUN would bounce that disk from the MD.  Current and subsequent
WRITE SAME cmds would process just fine to the member disk that actually
supported the command. 

To kick off WRITE SAME commands, I added a new ext4 filesystem to the
disk.  When mounting (no special options) this executes the following
call chain:

  ext4_lazyinit_thread
    ext4_init_inode_table
      sb_issue_zeroout
        blkdev_issue_zeroout
          blkdev_issue_write_same

When the first WRITE SAME hits the VirtualLUN, MD kicks it from the RAID
and degrades the array:

  EXT4-fs (md125): mounted filesystem with ordered data mode. Opts: (null)
  sd 2:0:6:0: [sds] CDB: 
  Write same(16): 93 00 00 00 00 00 00 00 21 10 00 00 0f f8 00 00
  mpt2sas0:        sas_address(0x500605b0006c0ae0), phy(3)
  mpt2sas0:        handle(0x000b), ioc_status(scsi data underrun)(0x0045), smid(59)
  mpt2sas0:        scsi_status(check condition)(0x02), scsi_state(autosense valid )(0x01)
  mpt2sas0:        [sense_key,asc,ascq]: [0x05,0x20,0x00], count(96)
  md/raid1:md125: Disk failure on sds1, disabling device.
  md/raid1:md125: Operation continuing on 1 devices.

the bitmap file starts recording dirty chunks:

  Sync Size : 4192192 (4.00 GiB 4.29 GB)
     Bitmap : 1048048 bits (chunks), 16409 dirty (1.6%)

The MD write_same_max_bytes are left at 33553920 until the VirtualLUN is
failed/remove/re-added.  After the WRITE SAME failure, the VirtualLUN's
max_write_same_blocks have been set to 0.  When re-added to the MD, this
value is reconsidered in the MD's write_same_max_bytes, which also gets
set to zero.  This behavior seems okay as the remaining good disk fully
supported WRITE SAME when the RAID was degraded.  Once the
non-supporting component disk was added to the RAID1, WRITE SAME was
disabled for the MD:

  /sys/class/scsi_disk/2:0:4:0/max_write_same_blocks: 0
  /sys/class/scsi_disk/3:0:2:0/max_write_same_blocks: 65535
  /sys/block/md125/queue/write_same_max_bytes: 0

When the VirtualLUN was re-added to the RAID1, resync initiated.  Recall
that earlier I had dumped random bits on the entire MD device, so the
state of the disks should have looked like this:

  SAS  = init RAID sync + random bits + ext4 WRITE SAME 0's + ext4 misc
  VLUN = init RAID sync + random bits

and resync would need to consult the bitmap to repair the VLUN chunks
that WRITE SAME and whatever else ext4_lazyinit_thread layed down.

By setting the bitmap chunksize so small, the idea was to spread the
failed WRITE SAME across tracking bits.  CDB WRITE SAME num blocks was
0x0FF8 (4088) and 4088 x 512 (block size) ~= 2MB (much greater than
4KB).  With a systemtap probe, I saw 32 WRITE SAME commands (all about
4KB blocks) emitted from the block layer via ext4_lazyinit_thread.  So
the estimated dirty bits for all 32 should be somewhere around:

  32 * (2MB disk dirty / 4K disk per bit) = 16384 dirty bits

pretty close to the observed 16409 (the rest I assume were other ext4
housekeeping).

At this point we know:

  - Failed WRITE SAME will kick disk from MD RAID1
  - WRITE SAME is disabled if unsupported disk added to MD
  - Failed WRITE SAME is properly handled by bitmap, even when
    spanning bitmap bits.

A few outstanding questions that I have, maybe Neil or someone more
familiar with the code could answer.

Q1 - Is mddev->chunk_sectors is always zero for RAID1?

Q2 - I noticed handle_write_finished calls narrow_write_error to try and
     potentially avoid failing an entire device.  In my tests,
     narrow_write_error never succeeded as rdev->badblocks.shift = -1.

     I think this part of the bad block list code Neil has been working
     on.  I don't suppose this is the proper place for MD to reset
     write_same_max_bytes to disable future WRITE SAME and handling the
     individual writes here instead of the block layer?

Regards,

-- Joe


>From b12c24ee0fce802f35263da65d236694b01c99cf Mon Sep 17 00:00:00 2001
From: Joe Lawrence <joe.lawrence@xxxxxxxxxxx>
Date: Fri, 7 Jun 2013 15:25:54 -0400
Subject: [PATCH] raid1: properly set blk_queue_max_write_same_sectors

MD RAID1 chunk_sectors will always be zero, unlike RAID0, so RAID1 does
not need to worry about limiting the write same sectors in that regard.
Let disk_stack_limits choose the minimum of the RAID1 components write
same values.

Signed-off-by: Joe Lawrence <joe.lawrence@xxxxxxxxxxx>
---
 drivers/md/raid1.c | 3 ---
 1 file changed, 3 deletions(-)

diff --git a/drivers/md/raid1.c b/drivers/md/raid1.c
index fd86b37..3dc9ad6 100644
--- a/drivers/md/raid1.c
+++ b/drivers/md/raid1.c
@@ -2821,9 +2821,6 @@ static int run(struct mddev *mddev)
 	if (IS_ERR(conf))
 		return PTR_ERR(conf);
 
-	if (mddev->queue)
-		blk_queue_max_write_same_sectors(mddev->queue,
-						 mddev->chunk_sectors);
 	rdev_for_each(rdev, mddev) {
 		if (!mddev->gendisk)
 			continue;
-- 
1.8.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html