On Thursday October 22, gmsoft@xxxxxxxxxxxx wrote: > > Neil, > > While redoing the reboot test, I've also noticed this : > When I first issue the --grow command, I see the following in dmesg : > [192752.106467] md: reshape of RAID array md0 > [192752.106473] md: minimum _guaranteed_ speed: 200000 KB/sec/disk. > [192752.106479] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape. > > The minimum guaranteed speed should be 1000KB/sec according to the > entry in /proc/sys/dev/raid/speed_limit_min. This is expected. Each array can have a local setting in /sys/block/mdX/md/sync_speed_min which overrides the global setting. When a reshape does not change the size of the array, we need to constantly create a backup of the few stripes 'currently' being reshaped, so that in the event on an unclean shutdown (crash/power failure) we can restart the reshape without data loss. The process of reading data to make the backup looks like non-sync IO to md, so it would normally slow down the resync process. That is not a good idea, so mdadm deliberately sets ..../md/sync_speed_min very high to keep the resync moving. > > Also, the performances are not really good. I have about 400K/sec according to /proc/mdstat. > > Now, if I stop the array and assemble it again, things are better. The output in dmesg displays the correct value : > [193138.646204] md: minimum _guaranteed_ speed: 1000 KB/sec/disk. > [193138.646210] md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for reshape. This is different because when assembling the array, mdadm doesn't set the sync_speed_min until after the reshape has started. I might try to get mdadm to set it before starting the reshape to avoid confusion. > > And perf are much better, I now get ~1500K/s which shrinks the time of the reshape from ~2 weeks to 'only' a few days. That is surprising. The speed of 1500K/sec seems more reasonable, but the fact that it changed after you restarted does surprise me. (goes off to experiment and explore the code). Ahhh... bug. in Grow_reshape, we have code: if (ndata == odata) { /* Make 'blocks' bigger for better throughput, but * not so big that we reject it below. */ if (blocks * 32 < sra->component_size) blocks *= 16; } else This is meant to do the backup in larger chunks in the case where the array isn't changing size (where the array does change size, we only do the backup for a fraction of a second so it doesn't matter). However sra->component_size is not initialised, so it zero, so 'blocks' does not get changed. (->component size gets set a little later in "sra = sysfs_read(fd,.....)") So the reshape is being done with a very small buffer, and you get bad performance. The matching code in Grow_continue doesn't check for component_size and so doesn't suffer the same problem. A bit of experimentation shows that you can increase the throughput quite a bit more by changing the multiply factor to e.g. 64 and increasing the stripe_cache_size (in /sys/.../md/) I wonder how to pick an 'optimal' size.... Maybe I could get the backup process to occasionally look at stripe_cache_size and adjust the backup size based on that. Then the admin could try increasing the cache size to improve throughput, but be careful not to exhaust memory. I'll have to think about it a bit. Thanks for your feedback. NeilBrown -- To unsubscribe from this list: send the line "unsubscribe linux-raid" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html