On 5/6/22 8:33 PM, Jens Axboe wrote: > On 5/6/22 5:26 PM, Jens Axboe wrote: >> On 5/6/22 4:23 PM, Jens Axboe wrote: >>> On 5/6/22 1:00 AM, Hao Xu wrote: >>>> Let multishot support multishot mode, currently only add accept as its >>>> first comsumer. >>>> theoretical analysis: >>>> 1) when connections come in fast >>>> - singleshot: >>>> add accept sqe(userpsace) --> accept inline >>>> ^ | >>>> |-----------------| >>>> - multishot: >>>> add accept sqe(userspace) --> accept inline >>>> ^ | >>>> |--*--| >>>> >>>> we do accept repeatedly in * place until get EAGAIN >>>> >>>> 2) when connections come in at a low pressure >>>> similar thing like 1), we reduce a lot of userspace-kernel context >>>> switch and useless vfs_poll() >>>> >>>> >>>> tests: >>>> Did some tests, which goes in this way: >>>> >>>> server client(multiple) >>>> accept connect >>>> read write >>>> write read >>>> close close >>>> >>>> Basically, raise up a number of clients(on same machine with server) to >>>> connect to the server, and then write some data to it, the server will >>>> write those data back to the client after it receives them, and then >>>> close the connection after write return. Then the client will read the >>>> data and then close the connection. Here I test 10000 clients connect >>>> one server, data size 128 bytes. And each client has a go routine for >>>> it, so they come to the server in short time. >>>> test 20 times before/after this patchset, time spent:(unit cycle, which >>>> is the return value of clock()) >>>> before: >>>> 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075 >>>> +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984 >>>> +1934226+1914385)/20.0 = 1927633.75 >>>> after: >>>> 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112 >>>> +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506 >>>> +1871324+1940803)/20.0 = 1894750.45 >>>> >>>> (1927633.75 - 1894750.45) / 1927633.75 = 1.65% >>>> >>>> >>>> A liburing test is here: >>>> https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c >>> >>> Wish I had seen that, I wrote my own! But maybe that's good, you tend to >>> find other issues through that. >>> >>> Anyway, works for me in testing, and I can see this being a nice win for >>> accept intensive workloads. I pushed a bunch of cleanup patches that >>> should just get folded in. Can you fold them into your patches and >>> address the other feedback, and post a v3? I pushed the test branch >>> here: >>> >>> https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot >> >> Quick benchmark here, accepting 10k connections: >> >> Stock kernel >> real 0m0.728s >> user 0m0.009s >> sys 0m0.192s >> >> Patched >> real 0m0.684s >> user 0m0.018s >> sys 0m0.102s >> >> Looks like a nice win for a highly synthetic benchmark. Nothing >> scientific, was just curious. > > One more thought on this - how is it supposed to work with > accept-direct? One idea would be to make it incrementally increasing. > But we need a good story for that, if it's exclusive to non-direct > files, then it's a lot less interesting as the latter is really nice win > for lots of files. If we can combine the two, even better. Running some quick testing, on an actual test box (previous numbers were from a vm on my laptop): Testing singleshot, normal files Did 10000 accepts ________________________________________________________ Executed in 216.10 millis fish external usr time 9.32 millis 150.00 micros 9.17 millis sys time 110.06 millis 67.00 micros 109.99 millis Testing multishot, fixed files Did 10000 accepts ________________________________________________________ Executed in 189.04 millis fish external usr time 11.86 millis 159.00 micros 11.71 millis sys time 93.71 millis 70.00 micros 93.64 millis That's about ~19 usec to accept a connection, pretty decent. Using singleshot and with fixed files, it shaves about ~8% off, ends at around 200msec. I think we can get away with using fixed files and multishot, attaching the quick patch I did below to test it. We need something better than this, otherwise once the space fills up, we'll likely end up with a sparse space and the naive approach of just incrementing the next slot won't work at all. -- Jens Axboe