Re: [PATCH] sbitmap: ammortize cost of clearing bits

Jens Axboe <axboe@xxxxxxxxx> · Thu, 29 Nov 2018 15:19:12 -0700

On 11/29/18 2:53 PM, Omar Sandoval wrote:
> On Thu, Nov 29, 2018 at 01:00:25PM -0700, Jens Axboe wrote:
>> sbitmap maintains a set of words that we use to set and clear bits, with
>> each bit representing a tag for blk-mq. Even though we spread the bits
>> out and maintain a hint cache, one particular bit allocated will end up
>> being cleared in the exact same spot.
>>
>> This introduces batched clearing of bits. Instead of clearing a given
>> bit, the same bit is set in a cleared/free mask instead. If we fail
>> allocating a bit from a given word, then we check the free mask, and
>> batch move those cleared bits at that time. This trades 64 atomic bitops
>> for 2 cmpxchg(). On top of that, we do those sequentially, hopefully
>> making that a bit cheaper as well.
>>
>> In a threaded poll test case, half the overhead of getting and clearing
>> tags is removed with this change. On another poll test case with a
>> single thread, performance is unchanged.
>>     
>> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx>
>>
>> ---
>>
>> This patch is on top of the round robin fix for sbitmap just posted.
> 
> So I think this can lead to hangs due to interactions with wake_batch.
> Bear with me:
> 
> Let's say the sbitmap_queue was created with depth=64 and shift=6,
> i.e., 64 bits all in one word. In this case, wake_batch=8, because 64
> bits clearing should be enough to wake up in 8 batches of 8.
> 
> Let's say all 64 bits are allocated and there are 8 waiters. All 64
> bits are then freed (in the cleared word), so we wake up the 8
> waiters. Let's say they all attempt __sbitmap_get_word(), fail, and
> get to sbitmap_deferred_clear() at the same time. One of them does the
> cmpxchg() on cleared, setting it to zero, and then the rest see that
> cleared is zero, so they return because there don't appear to be any
> cleared bits. The first thread succeeds, and the rest go to sleep. 
> 
> Now, when the first thread clears the bit, it only subtracts one from
> the batch, which isn't enough to do a wakeup. Hang!
> 
> This example is contrived, but in general I think that the window
> between the cleared mask being zeroed and the word being cleared is
> going to cause problems.

Honestly, I would have been surprised if the initial version was
bullet proof! I think it should be workable though, but we probably want
to clear in a different order. IOW, clear ->word first, and then update
->cleared to avoid someone finding ->cleared == 0 and ending up
sleeping. Even now the race is tiny, especially considering we have
multiple maps.

I'll take a look at clearing in the opposite order in a bit.

>> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h
>> index 804a50983ec5..cec685b89998 100644
>> --- a/include/linux/sbitmap.h
>> +++ b/include/linux/sbitmap.h
>> @@ -30,14 +30,19 @@ struct seq_file;
>>   */
>>  struct sbitmap_word {
>>  	/**
>> -	 * @word: The bitmap word itself.
>> +	 * @depth: Number of bits being used in @word/@cleared
>>  	 */
>> -	unsigned long word;
>> +	unsigned long depth;
>>  
>>  	/**
>> -	 * @depth: Number of bits being used in @word.
>> +	 * @word: word holding free bits
>>  	 */
>> -	unsigned long depth;
>> +	unsigned long word ____cacheline_aligned_in_smp;
> 
> Completely separate note (assuming that we can iron out the race above),
> this triples the size of each map. Does the word have to be in a
> separate cacheline from the depth? Ignoring resize, depth and word are
> always accessed together, so it seems to me that it would benefit from
> sharing a cacheline now that clearing happens elsewhere.

We can have depth and word share, that's not a concern. I'll make that
change.

-- 
Jens Axboe