With 48 threads i am getting 200 mb/s, about the same with 48 separate uring instances. With single uring instance (or with shared pool) - 60 mb/s. fs - ext4, device - ssd. On Mon, 17 Aug 2020 at 17:29, Jens Axboe <axboe@xxxxxxxxx> wrote: > > On 8/17/20 4:46 AM, Dmitry Shulyak wrote: > > Hi everyone, > > > > I noticed in iotop that all writes are executed by the same thread > > (io_wqe_worker-0). This is a significant problem if I am using files > > with mentioned flags. Not the case with reads, requests are > > multiplexed over many threads (note the different name > > io_wqe_worker-1). The problem is not specific to O_SYNC, in the > > general case I can get higher throughput with thread pool and regular > > system calls, but specifically with O_SYNC the throughput is the same > > as if I were using a single thread for writing. > > > > The setup is always the same, ring per thread with shared workers pool > > (IORING_SETUP_ATTACH_WQ), and high submission rate. Also, it is > > possible to get around this performance issue by using separate worker > > pools, but then I have to load balance workload between many rings for > > perf gains. > > > > I thought that it may have something to do with the IOSQE_ASYNC flag, > > but setting it had no effect. > > > > Is it expected behavior? Are there any other solutions, except > > creating many rings with isolated worker pools? > > This is done on purpose, as buffered writes end up being serialized > on the inode mutex anyway. So if you spread the load over multiple > workers, you generally just waste resources. In detail, writes to the > same inode are serialized by io-wq, it doesn't attempt to run them > in parallel. > > What kind of performance are you seeing with io_uring vs your own > thread pool that doesn't serialize writes? On what fs and what kind > of storage? > > -- > Jens Axboe >