Jens Axboe <axboe@xxxxxxxxx> writes: > On 11/5/22 5:10 PM, Gabriel Krisman Bertazi wrote: >> Performance-wise, one should expect very similar performance to the >> original algorithm for the case where there is no queueing. In both the >> old algorithm and this implementation, the first thing is to check >> ws_active, which bails out if there is no queueing to be managed. In the >> new code, we took care to avoid accounting completions and wakeups when >> there is no queueing, to not pay the cost of atomic operations >> unnecessarily, since it doesn't skew the numbers. >> >> For more interesting cases, where there is queueing, we need to take >> into account the cross-communication of the atomic operations. I've >> been benchmarking by running parallel fio jobs against a single hctx >> nullb in different hardware queue depth scenarios, and verifying both >> IOPS and queueing. >> >> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel >> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, >> varying only the hardware queue length per test. >> >> queue size 2 4 8 16 32 64 >> 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) >> patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K) >> >> The following is a similar experiment, ran against a nullb with a single >> bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40 >> parallel fio jobs operating on the same device >> >> queue size 2 4 8 16 32 64 >> 6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K) >> patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K) > > What's the queue depth of these devices? That's the interesting question > here, as it'll tell us if any of these are actually hitting the slower > path where you made changes. > Hi Jens, The hardware queue depth is a parameter being varied in this experiment. Each column of the tables has a different queue depth. Its value is the first line (queue size) of both tables. For instance, looking at the first table, for a device with hardware queue depth=2, 6.1-rc2 gave 1681K IOPS and the patched version gave 1721.8K IOPS. As mentioned, I monitored the size of the sbitmap wqs during the benchmark execution to confirm it was indeed hitting the slow path and queueing. Indeed, I observed less queueing on higher QDs (16,32) and even less for QD=64. For QD<=8, there was extensive queueing present throughout the execution. I should provide the queue size over time alongside the latency numbers. I have to rerun the benchmarks already to collect the information Chaitanya requested. > I suspect you are for the second set of numbers, but not for the first > one? No. both tables show some level of queueing. The shared bitmap in table 2 surely has way more intensive queueing, though. > Anything that isn't hitting the wait path for tags isn't a very useful > test, as I would not expect any changes there. Even when there is less to no queueing (QD=64 in this data), we still enter sbitmap_queue_wake_up and bail out on the first line !wait_active. This is why I think it is important to include QD=64 here. it is less interesting data, as I mentioned, but it shows no regressions of the faspath. Thanks, -- Gabriel Krisman Bertazi