Hello, I'am writing a small web + embedded database application taking advantage of the multicore performance of the latest AMD Epyc (up to 128 threads/CPU). Is there any performance advantage of using per thread uring setups? Such as every thread will own its unique sq+cq. My feeling is there are no gains since internally, in Linux kernel, the uring system is represented as a single queue pickup thread anyway(?) and sharing a one pair of sq+cq (through exclusive locks) via all threads would be enough to achieve maximum throughput. I want to squeeze the max performance out of uring in multi threading clients <-> server environment, where the max number of threads is always bounded by the max number of CPUs cores. Regards, Dmitry