On 11/5/22 5:10 PM, Gabriel Krisman Bertazi wrote: > sbitmap suffers from code complexity, as demonstrated by recent fixes, > and eventual lost wake ups on nested I/O completion. The later happens, > from what I understand, due to the non-atomic nature of the updates to > wait_cnt, which needs to be subtracted and eventually reset when equal > to zero. This two step process can eventually miss an update when a > nested completion happens to interrupt the CPU in between the wait_cnt > updates. This is very hard to fix, as shown by the recent changes to > this code. > > The code complexity arises mostly from the corner cases to avoid missed > wakes in this scenario. In addition, the handling of wake_batch > recalculation plus the synchronization with sbq_queue_wake_up is > non-trivial. > > This patchset implements the idea originally proposed by Jan [1], which > removes the need for the two-step updates of wait_cnt. This is done by > tracking the number of completions and wakeups in always increasing, > per-bitmap counters. Instead of having to reset the wait_cnt when it > reaches zero, we simply keep counting, and attempt to wake up N threads > in a single wait queue whenever there is enough space for a batch. > Waking up less than batch_wake shouldn't be a problem, because we > haven't changed the conditions for wake up, and the existing batch > calculation guarantees at least enough remaining completions to wake up > a batch for each queue at any time. > > Performance-wise, one should expect very similar performance to the > original algorithm for the case where there is no queueing. In both the > old algorithm and this implementation, the first thing is to check > ws_active, which bails out if there is no queueing to be managed. In the > new code, we took care to avoid accounting completions and wakeups when > there is no queueing, to not pay the cost of atomic operations > unnecessarily, since it doesn't skew the numbers. > > For more interesting cases, where there is queueing, we need to take > into account the cross-communication of the atomic operations. I've > been benchmarking by running parallel fio jobs against a single hctx > nullb in different hardware queue depth scenarios, and verifying both > IOPS and queueing. > > Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel > jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, > varying only the hardware queue length per test. > > queue size 2 4 8 16 32 64 > 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) > patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K) > > The following is a similar experiment, ran against a nullb with a single > bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40 > parallel fio jobs operating on the same device > > queue size 2 4 8 16 32 64 > 6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K) > patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K) What's the queue depth of these devices? That's the interesting question here, as it'll tell us if any of these are actually hitting the slower path where you made changes. I suspect you are for the second set of numbers, but not for the first one? Anything that isn't hitting the wait path for tags isn't a very useful test, as I would not expect any changes there. -- Jens Axboe