On Tue, Mar 04, 2014 at 11:40:43PM +0100, Peter Zijlstra wrote: > On Tue, Mar 04, 2014 at 12:48:26PM -0500, Waiman Long wrote: > > Peter, > > > > I was trying to implement the generic queue code exchange code using > > cmpxchg as suggested by you. However, when I gathered the performance > > data, the code performed worse than I expected at a higher contention > > level. Below were the execution time of the benchmark tool that I sent > > you: > > I'm just not seeing that; with test-4 modified to take the AMD compute > units into account: OK; I tried on a few larger machines and I can indeed see it there. That said; our code doesn't differ that much. I see why you're not doing too well on the 2 CPU contention. You've got an atomic op too much in that path. But given you see benefit even with 2 atomic ops (I had mixed results on that) we can do the pending/waiter thing unconditionally for NR_CPUS>16k. I also think I can do your full xchg thing without allowing lock steals. I'll try and do a full series tomorrow that starts with simple code and builds on that, doing each optimization one by one. _______________________________________________ Virtualization mailing list Virtualization@xxxxxxxxxxxxxxxxxxxxxxxxxx https://lists.linuxfoundation.org/mailman/listinfo/virtualization