Hi, Now that I've done some cleanup of the mdadm testing infrastructure as well as a lot of long run testing and bug fixes I'm much more confident in the correctness of this series. The previous posting is at [1]. Patch 14 has been completely reworked from the previous series (where much of the feedback was on). Rare bugs were found with the original method where if the array changed shape at just the right time while first_wrap was true, the algorithm would fail and blocks would not be written correctly. To fix this, the new version uses a bitmap to track which pages have been added to the stripe_head. This requires limitting the size of the request but to a size greater than the current limit (which is based on the number of segments). I've also included another patch to remove the limit on the number of segments (seeing it is not needed) and the limit on the number of sectors is higher and ends up with less bio splitting and fewer bios that are unaligned with the chunk size. -- I've been doing some work trying to improve the bulk write performance of raid5 on large systems with fast NVMe drives. The bottleneck appears largely to be lock contention on the hash_lock and device_lock. This series improves the situation slightly by addressing a couple of low hanging fruit ways to take the lock fewer times in the request path. Patch 11 adjusts how batching works by keeping a reference to the previous stripe_head in raid5_make_request(). Under most situtations, this removes the need to take the hash_lock in stripe_add_to_batch_list() which should reduce the number of times the lock is taken by a factor of about 2. Patch 14 pivots the way raid5_make_request() works. Before the patch, the code must find the stripe_head for every 4KB page in the request, so each stripe head must be found once for every data disk. The patch changes this so that all the data disks can be added to a stripe_head at once and the number of times the stripe_head must be found (and thus the number of times the hash_lock is taken) should be reduced by a factor roughly equal to the number of data disks. Patch 16 increases the restriction on block layer IO size to reduce the amount of bio splitting which decreases the amount of broken batches that occur with large IOs due to the unecessary splitting. I've also included Patch 15 which changes some debug prints to make debugging a bit easier. The remaining patches are just cleanup and prep patches for those two patches. Doing apples to apples testing this series on a small VM with 5 ram disks, I saw a bandwidth increase of roughly 14% and lock contentions on the hash_lock (as reported by lock stat) reduced by more than a factor of 5 (though it is still significantly contended). Testing on larger systems with NVMe drives saw similar small bandwidth increases from 3% to 20% depending on the parameters. Oddly small arrays had larger gains, likely due to them having lower starting bandwidths; I would have expected larger gains with larger arrays (seeing there should have been even fewer locks taken in raid5_make_request()). This series is based on the current md/md-next (facef3b96c5b9565). A git branch is available here: https://github.com/sbates130272/linux-p2pmem raid5_lock_cont_v3 Logan [1] https://lkml.kernel.org/r/20220420195425.34911-1-logang@xxxxxxxxxxxx -- Changes since v2: - Rebased on current md-next branch (facef3b96c5b9565) - Reworked Pivot patch with bitmap due to unfixable bug - Changed to a ternary operator in ahead_of_reshape() helper (per Paul) - Seperated out the functional change from non-functional change in the first patch (per Paul) - Dropped an unecessary hash argument in __find_stripe() (per Christoph) - Fixed some minor commit message and comment errors - Collected tags from Christoph and Guoqing Changes since v1: - Rebased on current md-next branch (190a901246c69d79) - Added patch to create a helper for checking if a sector is ahead of the reshape (per Christoph) - Reworked the __find_stripe() patch to create a find_get_stripe() helper (per Christoph) - Added more patches to further refactor raid5_make_request() and pull most of the loop body into a helper function (per Christoph) - A few other minor cleanups (boolean return, droping casting when printing sectors, commit message grammar) as suggested by Christoph. - Fixed two uncommon but bad data corruption bugs in that were found. -- Logan Gunthorpe (15): md/raid5: Make logic blocking check consistent with logic that blocks md/raid5: Factor out ahead_of_reshape() function md/raid5: Refactor raid5_make_request loop md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio() md/raid5: Move common stripe get code into new find_get_stripe() helper md/raid5: Factor out helper from raid5_make_request() loop md/raid5: Drop the do_prepare flag in raid5_make_request() md/raid5: Move read_seqcount_begin() into make_stripe_request() md/raid5: Refactor for loop in raid5_make_request() into while loop md/raid5: Keep a reference to last stripe_head for batch md/raid5: Refactor add_stripe_bio() md/raid5: Check all disks in a stripe_head for reshape progress md/raid5: Pivot raid5_make_request() md/raid5: Improve debug prints md/raid5: Increase restriction on max segments per request drivers/md/raid5.c | 641 +++++++++++++++++++++++++++++---------------- 1 file changed, 418 insertions(+), 223 deletions(-) base-commit: facef3b96c5b9565fa0416d7701ef990ef96e5a6 -- 2.30.2