Thank you John, Damien for extensive measurements. I don't have much to add as my measurements are probably just a subset of Damien's and repeat their results. One interesting thing I noticed when experimenting with this: we mostly talk about average throughput, but sometimes it is interesting to see the instant values (measured over small time slices). For example, for 4kb block size, qd=1, 50/50 randrw job for a dm-crypt encrypted ramdisk with ecb(cipher_null) cipher I just continuously run in the terminal, I can see the instant throughput having somewhat bimodal distribution: it reliably jumps between ~120 MiB/s and ~80 MiB/s medians (the overall average throughput being ~100 MiB/s of course). This is for dm-crypt with workqueues. If I disable the workqueues the distribution of the instant throughput becomes "normal". Without looking into much detail I wonder if HT has some side-effects on dm-crypt processing (I have it enabled), because it seems not all "cores" are equal for dm-cypt even on the null cipher. I might get my hands on an arm64 server soon and curious to see how dm-crypt and workques will compare there. Regards, Ignat On Wed, Aug 19, 2020 at 8:10 AM Damien Le Moal <Damien.LeMoal@xxxxxxx> wrote: > > John, > > On 2020/08/19 13:25, John Dorminy wrote: > > Your points are good. I don't know a good macrobenchmark at present, > > but at least various latency numbers are easy to get out of fio. > > > > I ran a similar set of tests on an Optane 900P with results below. > > 'clat' is, as fio reports, the completion latency, measured in usec. > > 'configuration' is [block size], [iodepth], [jobs]; picked to be a > > varied selection that obtained excellent throughput from the drive. > > Table reports average, and 99th percentile, latency times as well as > > throughput. It matches Ignat's report that large block sizes using the > > new option can have worse latency and throughput on top-end drives, > > although that result doesn't make any sense to me. > > > > Happy to run some more there or elsewhere if there are suggestions. > > > > devicetype configuration MB/s clat mean clat 99% > > ------------------------------------------------------------------ > > nvme base 1m,32,4 2259 59280 67634 > > crypt default 1m,32,4 2267 59050 182000 > > crypt no_w_wq 1m,32,4 1758 73954.54 84411 > > nvme base 64k,1024,1 2273 29500.92 30540 > > crypt default 64k,1024,1 2167 29518.89 50594 > > crypt no_w_wq 64k,1024,1 2056 31090.23 31327 > > nvme base 4k,128,4 2159 924.57 1106 > > crypt default 4k,128,4 1256 1663.67 3294 > > crypt no_w_wq 4k,128,4 1703 1165.69 1319 > > I have been doing a lot of testing recently on dm-crypt, mostly for zoned > storage, that is with write workqueue disabled, but also with regular disks to > have something to compare to and verify my results. I confirm that I see similar > changes in throughput/latency in my tests: disabling workqueues improves > throughput for small IO sizes thanks to the lower latency (removed context > switch overhead), but the benefits of disabling the workqueues become dubious > for large IO sizes, and deep queue depth. See the heat-map attached for more > results (nullblk device used for these measurements with 1 job per CPU). > > I also pushed things further as my tests as I primarily focused on enterprise > systems with a large number of storage devices being used with a single server. > To flatten things out and avoid any performance limitations due to the storage > devices, PCIe and/or HBA bus speed and memory bus speed, I ended up performing > lots of tests using nullblk with different settings: > > 1) SSD like multiqueue setting without "none" scheduler, with irq_mode=0 > (immediate completion in submission context) and irq_mode=1 for softirq > completion (different completion context than submission). > 2) HDD like single queue with mq-deadline as the scheduler, and the different > irq_mode settings. > > I also played with CPU assignments for the fio jobs and tried various things. > > My observations are as follows, in no particular order: > > 1) Maximum throughput clearly directly depends on the numbers of CPUs involved > in the crypto work. The crypto acceleration is limited per core and so the > number of issuer contexts (for writes) and or completion contexts (for reads) > almost directly determine maximum performance with worqueue disabled. I measured > about 1.4GB/s at best on my system with a single writer 128KB/QD=4. > > 2) For a multi drive setup with IO issuers limited to a small set of CPUs, > performance does not scale with the number of disks as the crypto engine speed > of the CPUs being used is the limiting factor: both write encryption and read > decryption happen on that set of CPUs, regardless of the others CPUs load. > > 3) For single queue devices, write performance scales badly with the number of > CPUs used for IO issuing: the one CPU that runs the device queue to dispatch > commands end up doing a lot of crypto work for requests queued through other > CPUs too. > > 4) On a very busy system with a very large number of disks and CPUs used for > IOs, the total throughput I get is very close for all settings with workqueues > enabled and disabled, about 50GB/s total on my dual socket Xeon system. There > was a small advantage for the none scheduler/multiqueue setting that gave up to > 56GB/s with workqueues on and 47GB/s with workqueues off. The single > queue/mq-deadline case gave 51 GB/s and 48 GB/s with workqueues on/off. > > 5) For the tests with the large number of drives and CPUs, things got > interesting with the average latency: I saw about the same average with > workqueues on and off. But the p99 latency was about 3 times lower with > workqueues off than workqueues on. When all CPUs are busy, reducing overhead by > avoiding additional context switches clearly helps. > > 6) With an arguably more realistic workload of 66% read and 34 % writes (read > size is 64KB/1MB with a 60%/40% ratio and write size is fixed at 1MB), I ended > up with higher total throughput with workqueues disabled (44GB/s) vs enabled > (38GB/s). Average write latency was also 30% lower with workqueues disabled > without any significant change to the average read latency. > > From all these tests, I am currently considering that for a large system with > lots of devices, disabling workqueues is a win, as long as IO issuers are not > limited to a small set of CPUs. > > The benefits of disabling workqueues for a desktop like system or a server > system with one (or very few) super fast drives are much less clear in my > opinion. Average and p99 latency are generally better with workqueues off, but > total throughput may significantly suffer if only a small number of IO contexts > are involved, that is, a small number of CPUs participate in the crypto > processing. Then crypto hardware speed dictates the results and using workqueues > to get parallelism between more CPU cores can give better throughput. > > That said, I am thinking that from all this, we can extract some hints to > automate decision for using workqueues or not: > 1) Small IOs (e.g. 4K) would probably benefit from having workqueue disabled, > especially for 4Kn storage devices as such request would be processed as a > single block with a single crypto API call. > 2) It may be good to process any BIO marked with REQ_HIPRI (polling BIO) without > any workqueue, to reduce latency, as intended by the caller. > 3) We may want to have read-ahead reads use workqueues, especially for single > queue devices (HDDs) to avoid increasing latency for other reads completing > together with these read-ahead requests. > > In the end, I am still scratching my head trying to figure out what the best > default setup may be. > > Best regards. > > > -- > Damien Le Moal > Western Digital Research