On 5/28/24 12:07 PM, Jens Axboe wrote: > On 5/28/24 10:50 AM, Pavel Begunkov wrote: >> On 5/28/24 15:34, Jens Axboe wrote: >>> On 5/28/24 7:31 AM, Pavel Begunkov wrote: >>>> On 5/24/24 23:58, Jens Axboe wrote: >>>>> Hi, >>>>> >>>>> A ring setup with with IORING_SETUP_SINGLE_ISSUER, which is required to >>>> >>>> IORING_SETUP_SINGLE_ISSUER has nothing to do with it, it's >>>> specifically an IORING_SETUP_DEFER_TASKRUN optimisation. >>> >>> Right, I should change that in the commit message. It's task_complete >>> driving it, which is tied to DEFER_TASKRUN. >>> >>>>> use IORING_SETUP_DEFER_TASKRUN, will need two round trips through >>>>> generic task_work. This isn't ideal. This patchset attempts to rectify >>>>> that, taking a new approach rather than trying to use the io_uring >>>>> task_work infrastructure to handle it as in previous postings. >>>> >>>> Not sure why you'd want to piggyback onto overflows, it's not >>>> such a well made and reliable infra, whereas the DEFER_TASKRUN >>>> part of the task_work approach was fine. >>> >>> It's not right now, because it's really a "don't get into this >>> condition, if you do, things are slower". And this series doesn't really >>> change that, and honestly it doesn't even need to. It's still way better >>> than what we had before in terms of DEFER_TASKRUN and messages. >> >> Better than how it is now or comparing to the previous attempt? >> I think the one using io_uring's tw infra was better, which is >> where all wake ups and optimisations currently consolidate. > > Better than both - I haven't tested with the previous version, but I can > certainly do that. The reason why I think it'll be better is that it > avoids the double roundtrips. Yes v1 was using normal task_work which is > better, but it didn't solve what I think is the fundamental issue here. > > I'll forward port it and give it a spin, then we'll know. I suspect a bug in the previous patches, because this is what the forward port looks like. First, for reference, the current results: init_flags=3000 Wait on startup 3767: my fd 3, other 4 3768: my fd 4, other 3 Latencies for: Sender percentiles (nsec): | 1.0000th=[ 740], 5.0000th=[ 748], 10.0000th=[ 756], | 20.0000th=[ 764], 30.0000th=[ 764], 40.0000th=[ 772], | 50.0000th=[ 772], 60.0000th=[ 780], 70.0000th=[ 780], | 80.0000th=[ 860], 90.0000th=[ 892], 95.0000th=[ 900], | 99.0000th=[ 1224], 99.5000th=[ 1368], 99.9000th=[ 1656], | 99.9500th=[ 1976], 99.9900th=[ 3408] Latencies for: Receiver percentiles (nsec): | 1.0000th=[ 2736], 5.0000th=[ 2736], 10.0000th=[ 2768], | 20.0000th=[ 2800], 30.0000th=[ 2800], 40.0000th=[ 2800], | 50.0000th=[ 2832], 60.0000th=[ 2832], 70.0000th=[ 2896], | 80.0000th=[ 2928], 90.0000th=[ 3024], 95.0000th=[ 3120], | 99.0000th=[ 4080], 99.5000th=[15424], 99.9000th=[18560], | 99.9500th=[21632], 99.9900th=[58624] and here's with io_uring-msg_ring.1, which is just a straight forward forward port of the previous patches on the same base as v2: init_flags=3000 Wait on startup 4097: my fd 4, other 3 4096: my fd 3, other 4 Latencies for: Receiver percentiles (nsec): | 1.0000th=[ 5920], 5.0000th=[ 5920], 10.0000th=[ 5984], | 20.0000th=[ 5984], 30.0000th=[ 6048], 40.0000th=[ 6048], | 50.0000th=[ 6112], 60.0000th=[ 6304], 70.0000th=[ 6368], | 80.0000th=[ 6560], 90.0000th=[ 6880], 95.0000th=[ 7072], | 99.0000th=[ 7456], 99.5000th=[ 7712], 99.9000th=[ 8640], | 99.9500th=[10432], 99.9900th=[26240] Latencies for: Sender percentiles (nsec): | 1.0000th=[ 9536], 5.0000th=[ 9664], 10.0000th=[ 9664], | 20.0000th=[ 9920], 30.0000th=[ 9920], 40.0000th=[10048], | 50.0000th=[10176], 60.0000th=[10304], 70.0000th=[10432], | 80.0000th=[10688], 90.0000th=[10944], 95.0000th=[11328], | 99.0000th=[11840], 99.5000th=[12096], 99.9000th=[13888], | 99.9500th=[15424], 99.9900th=[34560] -- Jens Axboe