On 11/8/22 19:03, Gabriel Krisman Bertazi wrote: > Chaitanya Kulkarni <chaitanyak@xxxxxxxxxx> writes: > >>> For more interesting cases, where there is queueing, we need to take >>> into account the cross-communication of the atomic operations. I've >>> been benchmarking by running parallel fio jobs against a single hctx >>> nullb in different hardware queue depth scenarios, and verifying both >>> IOPS and queueing. >>> >>> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel >>> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb, >>> varying only the hardware queue length per test. >>> >>> queue size 2 4 8 16 32 64 >>> 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K) >>> patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K) >> >>> > > Hi Chaitanya, > > Thanks for the feedback. > >> So if I understand correctly >> QD 2,4,8 shows clear performance benefit from this patch whereas >> QD 16, 32, 64 shows drop in performance it that correct ? >> >> If my observation is correct then applications with high QD will >> observe drop in the performance ? > > To be honest, I'm not sure. Given the overlap of the standard variation > (in parenthesis) with the mean, I'm not sure the observed drop is > statistically significant. In my prior analysis, I thought it wasn't. > > I don't see where a significant difference would come from, to be honest, > because the higher the QD, the more likely it is to go through the > not-contended path, where sbq->ws_active == 0. This hot path is > identical to the existing implementation. > The numbers are taken on the null_blk, with the drop I could see here may end up being different on the real H/W ? and I cannot comment on that since we don't have that data ... Did you repeat the experiment with the real H/W like NVMe SSD ? >> Also, please share a table with block size/IOPS/BW/CPU (system/user) >> /LAT/SLAT with % increase/decrease and document the raw numbers at the >> end of the cover-letter for completeness along with fio job to others >> can repeat the experiment... > > This was issued against the nullb and the IO size is fixed, matching the > device's block size (512b), which is why I am not tracking BW, only > IOPS. I'm not sure the BW is still relevant in this scenario. > > I'll definitely follow up with CPU time and latencies, and share the > fio job. I'll also take another look on the significance of the > measured values for high QD. > Yes, please if CPU usage way higher then we need to know that above numbers are at the cost of the higher CPU, in that case IOPs per core B/W per core matrix can be very useful ? -ck