On 11/29/18 2:53 PM, Omar Sandoval wrote: > On Thu, Nov 29, 2018 at 01:00:25PM -0700, Jens Axboe wrote: >> sbitmap maintains a set of words that we use to set and clear bits, with >> each bit representing a tag for blk-mq. Even though we spread the bits >> out and maintain a hint cache, one particular bit allocated will end up >> being cleared in the exact same spot. >> >> This introduces batched clearing of bits. Instead of clearing a given >> bit, the same bit is set in a cleared/free mask instead. If we fail >> allocating a bit from a given word, then we check the free mask, and >> batch move those cleared bits at that time. This trades 64 atomic bitops >> for 2 cmpxchg(). On top of that, we do those sequentially, hopefully >> making that a bit cheaper as well. >> >> In a threaded poll test case, half the overhead of getting and clearing >> tags is removed with this change. On another poll test case with a >> single thread, performance is unchanged. >> >> Signed-off-by: Jens Axboe <axboe@xxxxxxxxx> >> >> --- >> >> This patch is on top of the round robin fix for sbitmap just posted. > > So I think this can lead to hangs due to interactions with wake_batch. > Bear with me: > > Let's say the sbitmap_queue was created with depth=64 and shift=6, > i.e., 64 bits all in one word. In this case, wake_batch=8, because 64 > bits clearing should be enough to wake up in 8 batches of 8. > > Let's say all 64 bits are allocated and there are 8 waiters. All 64 > bits are then freed (in the cleared word), so we wake up the 8 > waiters. Let's say they all attempt __sbitmap_get_word(), fail, and > get to sbitmap_deferred_clear() at the same time. One of them does the > cmpxchg() on cleared, setting it to zero, and then the rest see that > cleared is zero, so they return because there don't appear to be any > cleared bits. The first thread succeeds, and the rest go to sleep. > > Now, when the first thread clears the bit, it only subtracts one from > the batch, which isn't enough to do a wakeup. Hang! > > This example is contrived, but in general I think that the window > between the cleared mask being zeroed and the word being cleared is > going to cause problems. Honestly, I would have been surprised if the initial version was bullet proof! I think it should be workable though, but we probably want to clear in a different order. IOW, clear ->word first, and then update ->cleared to avoid someone finding ->cleared == 0 and ending up sleeping. Even now the race is tiny, especially considering we have multiple maps. I'll take a look at clearing in the opposite order in a bit. >> diff --git a/include/linux/sbitmap.h b/include/linux/sbitmap.h >> index 804a50983ec5..cec685b89998 100644 >> --- a/include/linux/sbitmap.h >> +++ b/include/linux/sbitmap.h >> @@ -30,14 +30,19 @@ struct seq_file; >> */ >> struct sbitmap_word { >> /** >> - * @word: The bitmap word itself. >> + * @depth: Number of bits being used in @word/@cleared >> */ >> - unsigned long word; >> + unsigned long depth; >> >> /** >> - * @depth: Number of bits being used in @word. >> + * @word: word holding free bits >> */ >> - unsigned long depth; >> + unsigned long word ____cacheline_aligned_in_smp; > > Completely separate note (assuming that we can iron out the race above), > this triples the size of each map. Does the word have to be in a > separate cacheline from the depth? Ignoring resize, depth and word are > always accessed together, so it seems to me that it would benefit from > sharing a cacheline now that clearing happens elsewhere. We can have depth and word share, that's not a concern. I'll make that change. -- Jens Axboe