Re: Very low write throughput on file opened with O_SYNC/O_DSYNC

Dmitry Shulyak <yashulyak@xxxxxxxxx> · Mon, 17 Aug 2020 18:49:58 +0300

With 48 threads i am getting 200 mb/s, about the same with 48 separate
uring instances.
With single uring instance (or with shared pool) - 60 mb/s.
fs - ext4, device - ssd.

On Mon, 17 Aug 2020 at 17:29, Jens Axboe <axboe@xxxxxxxxx> wrote:
>
> On 8/17/20 4:46 AM, Dmitry Shulyak wrote:
> > Hi everyone,
> >
> > I noticed in iotop that all writes are executed by the same thread
> > (io_wqe_worker-0). This is a significant problem if I am using files
> > with mentioned flags. Not the case with reads, requests are
> > multiplexed over many threads (note the different name
> > io_wqe_worker-1). The problem is not specific to O_SYNC, in the
> > general case I can get higher throughput with thread pool and regular
> > system calls, but specifically with O_SYNC the throughput is the same
> > as if I were using a single thread for writing.
> >
> > The setup is always the same, ring per thread with shared workers pool
> > (IORING_SETUP_ATTACH_WQ), and high submission rate. Also, it is
> > possible to get around this performance issue by using separate worker
> > pools, but then I have to load balance workload between many rings for
> > perf gains.
> >
> > I thought that it may have something to do with the IOSQE_ASYNC flag,
> > but setting it had no effect.
> >
> > Is it expected behavior? Are there any other solutions, except
> > creating many rings with isolated worker pools?
>
> This is done on purpose, as buffered writes end up being serialized
> on the inode mutex anyway. So if you spread the load over multiple
> workers, you generally just waste resources. In detail, writes to the
> same inode are serialized by io-wq, it doesn't attempt to run them
> in parallel.
>
> What kind of performance are you seeing with io_uring vs your own
> thread pool that doesn't serialize writes? On what fs and what kind
> of storage?
>
> --
> Jens Axboe
>