Re: Any performance gains from using per thread(thread local) urings?

Dmitry Sychov <dmitry.sychov@xxxxxxxxx> · Wed, 13 May 2020 14:01:28 +0300

Hi Hielke,

> If you want max performance, what you generally will see in non-blocking servers is one event loop per core/thread.
> This means one ring per core/thread. Of course there is no simple answer to this.
> See how thread-based servers work vs non-blocking servers. E.g. Apache vs Nginx or Tomcat vs Netty.

I think a lot depends on the internal uring implementation. To what
degree the kernel is able to handle multiple urings independently,
without much congestion points(like updates of the same memory
locations from multiple threads), thus taking advantage of one ring
per CPU core.

For example, if the tasks from multiple rings are later combined into
single input kernel queue (effectively forming a congestion point) I
see
no reason to use exclusive ring per core in user space.

[BTW in Windows IOCP is always one input+output queue for all(active) threads].

Also we could pop out multiple completion events from a single CQ at
once to spread the handling to cores-bound threads .

I thought about one uring per core at first, but now I'am not sure -
maybe the kernel devs have something to add to the discussion?

P.S. uring is the main reason I'am switching from windows to linux dev
for client-sever app so I want to extract the max performance possible
out of this new exciting uring stuff. :)

Thanks, Dmitry