Re: RAID6 rebuild oddity

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



On 29/03/17 12:08, NeilBrown wrote:


sdj is getting twice the utilization of the others but only about 10%
more rKB/sec.  That suggests lots of seeking.

Yes, something is not entirely sequential.

Does "fuser /dev/sdj" report anything funny?

No. No output. As far as I can tell nothing should be touching the disks other than the md kernel thread.

Is there filesystem IO happening? If so, what filesystem?
Have you told the filesystem about the RAID layout?
Maybe the filesystem keeps reading some index blocks that are always on
the same drive.

No. I probably wasn't as clear as I should have been in the initial post. There was nothing mounted at the time.

Right now the array contains one large LUKS container (dm-crypt). This was mounted and a continuous dd done to the dm device to zero it out :

4111195+1 records in
4111195+1 records out
34487205507072 bytes (34 TB) copied, 57781.1 s, 597 MB/s

So there is no filesystem on the drive.

I failed and removed sdi :

root@test:~# mdadm --fail /dev/md0 /dev/sdi
mdadm: set /dev/sdi faulty in /dev/md0
root@test:~# mdadm --remove /dev/md0 /dev/sdi
mdadm: hot removed /dev/sdi from /dev/md0
root@test:~# mdadm --zero-superblock /dev/sdi
root@test:~# mdadm --add /dev/md0 /dev/sdi
mdadm: added /dev/sdi

Rebooted the machine to remove all tweaks of things like stripe cache size, readahead, NCQ and anything else.

I opened the LUKS container, dd'd a meg to the start to write to the array and kick off the resync, then closed the LUKS container. At this point dm should no longer be touching the drive and I've verified the device has gone.

I then ran sync a couple of times and waited a couple of minutes until I was positive _nothing_ was touching md0, then ran :

blktrace -w 5 /dev/sd[bcdefgij] /dev/md0

So the problem moves from drive to drive?  Strongly suggests filesystem
metadata access to me.

Again, sorry for me not being clear. The situation changes on a resync specific basis. For example the reproduction I've done now I popped out sdi rather than sdb, and now the bottleneck is sdg. It is the same if the exact circumstances remain the same.


If you can capture several seconds of trace on all drives plus the
array, compress it and host it somewhere, I can pick it up and have
look.

I've captured 5 seconds. I was overly optimistic initially and tried 20 seconds, but that resulted in 100M bzipped.

This is 19M. The kernel is a clean 4.10.6 compiled up half an hour ago as I forgot to include block tracing in there.

http://www.fnarfbargle.com/private/170329-Resync/resync-blktrace.tar.bz2

root@test:~/bench# mdadm --detail /dev/md0
/dev/md0:
        Version : 1.2
  Creation Time : Wed Mar 22 14:01:41 2017
     Raid Level : raid6
     Array Size : 35162348160 (33533.43 GiB 36006.24 GB)
  Used Dev Size : 5860391360 (5588.90 GiB 6001.04 GB)
   Raid Devices : 8
  Total Devices : 8
    Persistence : Superblock is persistent

    Update Time : Wed Mar 29 16:08:20 2017
          State : clean, degraded, recovering
 Active Devices : 7
Working Devices : 8
 Failed Devices : 0
  Spare Devices : 1

         Layout : left-symmetric
     Chunk Size : 64K

 Rebuild Status : 2% complete

           Name : test:0  (local to host test)
           UUID : 93a09ba7:f159e9f5:7c478f16:6ca8858e
         Events : 715

    Number   Major   Minor   RaidDevice State
       8       8       16        0      active sync   /dev/sdb
       1       8       32        1      active sync   /dev/sdc
       2       8       48        2      active sync   /dev/sdd
       3       8       64        3      active sync   /dev/sde
       4       8       80        4      active sync   /dev/sdf
       5       8       96        5      active sync   /dev/sdg
       9       8      128        6      spare rebuilding   /dev/sdi
       7       8      144        7      active sync   /dev/sdj


avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.08    0.00    4.77    4.75    0.00   90.41

Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sda 0.00 1.40 0.00 3.00 0.00 10.60 7.07 0.01 4.67 0.00 4.67 4.67 1.40 sdb 14432.20 0.00 120.20 0.00 57295.20 0.00 953.33 2.04 16.71 16.71 0.00 3.23 38.80 sdc 14432.20 0.00 120.20 0.00 57295.20 0.00 953.33 1.99 16.26 16.26 0.00 3.21 38.60 sdd 14432.20 0.00 120.80 0.00 57602.40 0.00 953.68 2.14 17.55 17.55 0.00 3.18 38.40 sde 14432.20 0.00 120.80 0.00 57602.40 0.00 953.68 2.02 16.57 16.57 0.00 3.20 38.60 sdf 14432.20 0.00 120.80 0.00 57602.40 0.00 953.68 2.08 17.04 17.04 0.00 3.25 39.20 sdg 16135.40 0.00 224.60 0.00 65811.20 0.00 586.03 126.46 575.26 575.26 0.00 4.45 100.00 sdh 0.00 1.40 0.00 3.00 0.00 10.60 7.07 0.02 6.67 0.00 6.67 6.67 2.00 sdi 0.00 14982.60 0.00 123.20 0.00 59801.60 970.81 4.52 35.57 0.00 35.57 3.25 40.00 sdj 14432.40 0.00 120.80 0.00 57603.20 0.00 953.70 1.84 15.07 15.07 0.00 3.06 37.00 md1 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 md2 0.00 0.00 0.00 2.20 0.00 8.80 8.00 0.00 0.00 0.00 0.00 0.00 0.00 md0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00


Thanks for having a look. It seems odd to me, but I can't figure it out.

Regards,
Brad
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html



[Index of Archives]     [Linux RAID Wiki]     [ATA RAID]     [Linux SCSI Target Infrastructure]     [Linux Block]     [Linux IDE]     [Linux SCSI]     [Linux Hams]     [Device Mapper]     [Device Mapper Cryptographics]     [Kernel]     [Linux Admin]     [Linux Net]     [GFS]     [RPM]     [git]     [Yosemite Forum]


  Powered by Linux