On 29/03/17 12:08, NeilBrown wrote:
sdj is getting twice the utilization of the others but only about 10%
more rKB/sec. That suggests lots of seeking.
Yes, something is not entirely sequential.
Does "fuser /dev/sdj" report anything funny?
No. No output. As far as I can tell nothing should be touching the disks
other than the md kernel thread.
Is there filesystem IO happening? If so, what filesystem?
Have you told the filesystem about the RAID layout?
Maybe the filesystem keeps reading some index blocks that are always on
the same drive.
No. I probably wasn't as clear as I should have been in the initial
post. There was nothing mounted at the time.
Right now the array contains one large LUKS container (dm-crypt). This
was mounted and a continuous dd done to the dm device to zero it out :
4111195+1 records in
4111195+1 records out
34487205507072 bytes (34 TB) copied, 57781.1 s, 597 MB/s
So there is no filesystem on the drive.
I failed and removed sdi :
root@test:~# mdadm --fail /dev/md0 /dev/sdi
mdadm: set /dev/sdi faulty in /dev/md0
root@test:~# mdadm --remove /dev/md0 /dev/sdi
mdadm: hot removed /dev/sdi from /dev/md0
root@test:~# mdadm --zero-superblock /dev/sdi
root@test:~# mdadm --add /dev/md0 /dev/sdi
mdadm: added /dev/sdi
Rebooted the machine to remove all tweaks of things like stripe cache
size, readahead, NCQ and anything else.
I opened the LUKS container, dd'd a meg to the start to write to the
array and kick off the resync, then closed the LUKS container. At this
point dm should no longer be touching the drive and I've verified the
device has gone.
I then ran sync a couple of times and waited a couple of minutes until I
was positive _nothing_ was touching md0, then ran :
blktrace -w 5 /dev/sd[bcdefgij] /dev/md0
So the problem moves from drive to drive? Strongly suggests filesystem
metadata access to me.
Again, sorry for me not being clear. The situation changes on a resync
specific basis. For example the reproduction I've done now I popped out
sdi rather than sdb, and now the bottleneck is sdg. It is the same if
the exact circumstances remain the same.
If you can capture several seconds of trace on all drives plus the
array, compress it and host it somewhere, I can pick it up and have
look.
I've captured 5 seconds. I was overly optimistic initially and tried 20
seconds, but that resulted in 100M bzipped.
This is 19M. The kernel is a clean 4.10.6 compiled up half an hour ago
as I forgot to include block tracing in there.
http://www.fnarfbargle.com/private/170329-Resync/resync-blktrace.tar.bz2
root@test:~/bench# mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Wed Mar 22 14:01:41 2017
Raid Level : raid6
Array Size : 35162348160 (33533.43 GiB 36006.24 GB)
Used Dev Size : 5860391360 (5588.90 GiB 6001.04 GB)
Raid Devices : 8
Total Devices : 8
Persistence : Superblock is persistent
Update Time : Wed Mar 29 16:08:20 2017
State : clean, degraded, recovering
Active Devices : 7
Working Devices : 8
Failed Devices : 0
Spare Devices : 1
Layout : left-symmetric
Chunk Size : 64K
Rebuild Status : 2% complete
Name : test:0 (local to host test)
UUID : 93a09ba7:f159e9f5:7c478f16:6ca8858e
Events : 715
Number Major Minor RaidDevice State
8 8 16 0 active sync /dev/sdb
1 8 32 1 active sync /dev/sdc
2 8 48 2 active sync /dev/sdd
3 8 64 3 active sync /dev/sde
4 8 80 4 active sync /dev/sdf
5 8 96 5 active sync /dev/sdg
9 8 128 6 spare rebuilding /dev/sdi
7 8 144 7 active sync /dev/sdj
avg-cpu: %user %nice %system %iowait %steal %idle
0.08 0.00 4.77 4.75 0.00 90.41
Device: rrqm/s wrqm/s r/s w/s rkB/s wkB/s
avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 1.40 0.00 3.00 0.00 10.60
7.07 0.01 4.67 0.00 4.67 4.67 1.40
sdb 14432.20 0.00 120.20 0.00 57295.20 0.00
953.33 2.04 16.71 16.71 0.00 3.23 38.80
sdc 14432.20 0.00 120.20 0.00 57295.20 0.00
953.33 1.99 16.26 16.26 0.00 3.21 38.60
sdd 14432.20 0.00 120.80 0.00 57602.40 0.00
953.68 2.14 17.55 17.55 0.00 3.18 38.40
sde 14432.20 0.00 120.80 0.00 57602.40 0.00
953.68 2.02 16.57 16.57 0.00 3.20 38.60
sdf 14432.20 0.00 120.80 0.00 57602.40 0.00
953.68 2.08 17.04 17.04 0.00 3.25 39.20
sdg 16135.40 0.00 224.60 0.00 65811.20 0.00
586.03 126.46 575.26 575.26 0.00 4.45 100.00
sdh 0.00 1.40 0.00 3.00 0.00 10.60
7.07 0.02 6.67 0.00 6.67 6.67 2.00
sdi 0.00 14982.60 0.00 123.20 0.00 59801.60
970.81 4.52 35.57 0.00 35.57 3.25 40.00
sdj 14432.40 0.00 120.80 0.00 57603.20 0.00
953.70 1.84 15.07 15.07 0.00 3.06 37.00
md1 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
md2 0.00 0.00 0.00 2.20 0.00 8.80
8.00 0.00 0.00 0.00 0.00 0.00 0.00
md0 0.00 0.00 0.00 0.00 0.00 0.00
0.00 0.00 0.00 0.00 0.00 0.00 0.00
Thanks for having a look. It seems odd to me, but I can't figure it out.
Regards,
Brad
--
To unsubscribe from this list: send the line "unsubscribe linux-raid" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at http://vger.kernel.org/majordomo-info.html