Re: [PATCH v2 00/12] Improve Raid5 Lock Contention

Logan Gunthorpe <logang@xxxxxxxxxxxx> · Mon, 25 Apr 2022 09:37:00 -0600

On 2022-04-24 01:53, Guoqing Jiang wrote:
> 
> 
> On 4/21/22 3:54 AM, Logan Gunthorpe wrote:
>> Hi,
>>
>> This is v2 of this series which addresses Christoph's feedback and
>> fixes some bugs. The first posting is at [1]. A git branch is
>> available at [2].
>>
>> --
>>
>> I've been doing some work trying to improve the bulk write performance
>> of raid5 on large systems with fast NVMe drives. The bottleneck appears
>> largely to be lock contention on the hash_lock and device_lock. This
>> series improves the situation slightly by addressing a couple of low
>> hanging fruit ways to take the lock fewer times in the request path.
>>
>> Patch 9 adjusts how batching works by keeping a reference to the
>> previous stripe_head in raid5_make_request(). Under most situtations,
>> this removes the need to take the hash_lock in stripe_add_to_batch_list()
>> which should reduce the number of times the lock is taken by a factor of
>> about 2.
>>
>> Patch 12 pivots the way raid5_make_request() works. Before the patch, the
>> code must find the stripe_head for every 4KB page in the request, so each
>> stripe head must be found once for every data disk. The patch changes this
>> so that all the data disks can be added to a stripe_head at once and the
>> number of times the stripe_head must be found (and thus the number of
>> times the hash_lock is taken) should be reduced by a factor roughly equal
>> to the number of data disks.
>>
>> The remaining patches are just cleanup and prep patches for those two
>> patches.
>>
>> Doing apples to apples testing this series on a small VM with 5 ram
>> disks, I saw a bandwidth increase of roughly 14% and lock contentions
>> on the hash_lock (as reported by lock stat) reduced by more than a factor
>> of 5 (though it is still significantly contended).
>>
>> Testing on larger systems with NVMe drives saw similar small bandwidth
>> increases from 3% to 20% depending on the parameters. Oddly small arrays
>> had larger gains, likely due to them having lower starting bandwidths; I
>> would have expected larger gains with larger arrays (seeing there
>> should have been even fewer locks taken in raid5_make_request()).
>>
>> Logan
>>
>> [1] https://lkml.kernel.org/r/20220407164511.8472-1-logang@xxxxxxxxxxxx
>> [2] https://github.com/sbates130272/linux-p2pmem raid5_lock_cont_v2
>>
>> --
>>
>> Changes since v1:
>>    - Rebased on current md-next branch (190a901246c69d79)
>>    - Added patch to create a helper for checking if a sector
>>      is ahead of the reshape (per Christoph)
>>    - Reworked the __find_stripe() patch to create a find_get_stripe()
>>      helper (per Christoph)
>>    - Added more patches to further refactor raid5_make_request() and
>>      pull most of the loop body into a helper function (per Christoph)
>>    - A few other minor cleanups (boolean return, droping casting when
>>      printing sectors, commit message grammar) as suggested by Christoph.
>>    - Fixed two uncommon but bad data corruption bugs in that were found.
>>
>> --
>>
>> Logan Gunthorpe (12):
>>    md/raid5: Factor out ahead_of_reshape() function
>>    md/raid5: Refactor raid5_make_request loop
>>    md/raid5: Move stripe_add_to_batch_list() call out of add_stripe_bio()
>>    md/raid5: Move common stripe count increment code into __find_stripe()
>>    md/raid5: Factor out helper from raid5_make_request() loop
>>    md/raid5: Drop the do_prepare flag in raid5_make_request()
>>    md/raid5: Move read_seqcount_begin() into make_stripe_request()
>>    md/raid5: Refactor for loop in raid5_make_request() into while loop
>>    md/raid5: Keep a reference to last stripe_head for batch
>>    md/raid5: Refactor add_stripe_bio()
>>    md/raid5: Check all disks in a stripe_head for reshape progress
>>    md/raid5: Pivot raid5_make_request()
> 
> Generally, I don't object the cleanup patches since the code looks more 
> cleaner.
> But my concern is that since some additional function calls are added to 
> hot path
> (raid5_make_request), could the performance be affected?

There's a bit of logic added to the raid5_make_requests but it is all
local and should be fast, and it reduces the amount of calls to the slow
contended locks.

> And I think patch 9 and patch 12 are helpful for performance 
> improvement,  did
> you measure the performance without those cleanup patches?

Yes, I compared performance with and without this entire series.

Logan