Forgot some data: with the second test above, CPU: 48% user, 18% sys, 35% idle. CPU increased from 46% used in the first test to 65% used, the corresponding throughput increase was not as large, but that is expected on an 8-threads per core server since memory bandwidth and cache resources at a minimum are shared and only trivial tasks can scale 100%.
-----------------
Now, with 0ms delay, no threading change:
Throughput is 136000/min @184 users, response time 13ms. Response time has not jumped too drastically yet, but linear performance increases stopped at about 130 users or so. ProcArrayLock busy, very busy. CPU: 35% user, 11% system, 54% idle
With 0ms delay, and lock modification 2 (wake some, but not all)
Throughput is 161000/min @328 users, response time 28ms. At 184 users as before the change, throughput is 147000/min with response time 0.12ms. Performance scales linearly to 144 users, then slows down and slightly increases after that with more concurrency.
Throughput increase is between 15% and 25%.
Based on the above, I would guess that attaining closer to 100% utilization (its hard to get past 90% with that many cores no matter what), will probablyl give another 10 to 15% improvement at most, to maybe 180000/min throughput.
Its also rather interesting that the 2000 connection case with wait times gets 170000/min throughput and beats the 328 users with 0 delay result above. I suspect the ‘wake all’ version is just faster. I would love to see a ‘wake all shared, leave exclusives at front of queue’ version, since that would not allow lock starvation.