Re: [PATCH v2 0/5] fast poll multishot mode

Hao Xu <haoxu.linux@xxxxxxxxx> · Sun, 8 May 2022 00:01:31 +0800

在 2022/5/7 上午11:08, Jens Axboe 写道:
On 5/6/22 8:33 PM, Jens Axboe wrote:
On 5/6/22 5:26 PM, Jens Axboe wrote:
On 5/6/22 4:23 PM, Jens Axboe wrote:
On 5/6/22 1:00 AM, Hao Xu wrote:
Let multishot support multishot mode, currently only add accept as its
first comsumer.
theoretical analysis:
   1) when connections come in fast
     - singleshot:
               add accept sqe(userpsace) --> accept inline
                               ^                 |
                               |-----------------|
     - multishot:
              add accept sqe(userspace) --> accept inline
                                               ^     |
                                               |--*--|

     we do accept repeatedly in * place until get EAGAIN

   2) when connections come in at a low pressure
     similar thing like 1), we reduce a lot of userspace-kernel context
     switch and useless vfs_poll()

tests:
Did some tests, which goes in this way:

   server    client(multiple)
   accept    connect
   read      write
   write     read
   close     close

Basically, raise up a number of clients(on same machine with server) to
connect to the server, and then write some data to it, the server will
write those data back to the client after it receives them, and then
close the connection after write return. Then the client will read the
data and then close the connection. Here I test 10000 clients connect
one server, data size 128 bytes. And each client has a go routine for
it, so they come to the server in short time.
test 20 times before/after this patchset, time spent:(unit cycle, which
is the return value of clock())
before:
   1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
   +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
   +1934226+1914385)/20.0 = 1927633.75
after:
   1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
   +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
   +1871324+1940803)/20.0 = 1894750.45

(1927633.75 - 1894750.45) / 1927633.75 = 1.65%

A liburing test is here:
https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c

Wish I had seen that, I wrote my own! But maybe that's good, you tend to
find other issues through that.

Anyway, works for me in testing, and I can see this being a nice win for
accept intensive workloads. I pushed a bunch of cleanup patches that
should just get folded in. Can you fold them into your patches and
address the other feedback, and post a v3? I pushed the test branch
here:

https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot

Quick benchmark here, accepting 10k connections:

Stock kernel
real	0m0.728s
user	0m0.009s
sys	0m0.192s

Patched
real	0m0.684s
user	0m0.018s
sys	0m0.102s

Looks like a nice win for a highly synthetic benchmark. Nothing
scientific, was just curious.

One more thought on this - how is it supposed to work with
accept-direct? One idea would be to make it incrementally increasing.
But we need a good story for that, if it's exclusive to non-direct
files, then it's a lot less interesting as the latter is really nice win
for lots of files. If we can combine the two, even better.

Running some quick testing, on an actual test box (previous numbers were
from a vm on my laptop):

Testing singleshot, normal files
Did 10000 accepts

________________________________________________________
Executed in  216.10 millis    fish           external
    usr time    9.32 millis  150.00 micros    9.17 millis
    sys time  110.06 millis   67.00 micros  109.99 millis

Testing multishot, fixed files
Did 10000 accepts

________________________________________________________
Executed in  189.04 millis    fish           external
    usr time   11.86 millis  159.00 micros   11.71 millis
    sys time   93.71 millis   70.00 micros   93.64 millis

That's about ~19 usec to accept a connection, pretty decent. Using
singleshot and with fixed files, it shaves about ~8% off, ends at around
200msec.

I think we can get away with using fixed files and multishot, attaching
I'm not following, do you mean we shouldn't do the multishot+fixed file
or we should use multishot+fixed to make the result better?
the quick patch I did below to test it. We need something better than
Sorry Jens, I didn't see the quick patch, is there anything I misunderstand?
this, otherwise once the space fills up, we'll likely end up with a
sparse space and the naive approach of just incrementing the next slot
won't work at all.