Batched ww-mutexes, wound-wait vs wait-die etc.

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

 



Hi!

Thinking of adding ww-mutexes for reservation also of vmwgfx resources, (like surfaces), I became a bit worried that doubling the locks taken during command submission wouldn't be a good thing. Particularly on ESX servers where a high number of virtual machines running graphics on a multi-core processor would initiate a very high number of processor locked cycles. The method we use today is to reserve all those resources under a single mutex. Buffer objects are still using reservation objects and hence ww-mutexes, though.

So I figured a "middle way" would be to add batched ww-mutexes, where the ww-mutex locking state, instead of being manipulated atomically, was manipulated under a single lock-class global spinlock. We could then condense the sometimes 200+ locking operations per command submission to two, one for lock and one for unlock. Obvious drawbacks are that taking the spinlock is slightly more expensive than an atomic operation, and that we could introduce contention for the spinlock where there is no contention for an atomic operation.

So I set out to test this in practice. After reading up a bit on the theory it turned out that the current in-kernel wound-wait implementation, like once TTM (unknowingly), is actually not wound-wait, but wait-die. Correct name would thus be "wait-die mutexes", Some sources on the net claimed "wait-wound" is the better algorithm due to a reduced number of backoffs:

http://www.mathcs.emory.edu/~cheung/Courses/554/Syllabus/8-recv+serial/deadlock-compare.html

So I implemented both algorithms in a standalone testing module:

git+ssh://people.freedesktop.org/~thomash/ww_mutex_test

Some preliminary test trends:

1) Testing uncontended sequential command submissions: Batching ww-mutexes seems to be between 50% and 100% faster than the current kernel implementation. Still the kernel implementation performing much better than I thought.

2) Testing uncontended parallell command submission: Batching ww-mutexes slower (up to 50%) of the current kernel implementation, since the in-kernel implementation can make use of multi-core parallellism where the batching implementation sees spinlock contention. This effect should, however, probably  be relaxed if setting a longer command submission time, reducing the spinlock contention.

3) Testing contended parallell command submission: Batching is generally superior by usually around 50%, sometimes up to 100%, One of the reasons could be that batching appears to result in a significantly lower number of rollbacks.

5) Taking batching locks without actually batching can result i poor performance.

4) Wound-Wait vs Wait-Die. As predicted, particularly with a low number of parallell cs threads, Wound-wait appears to give a lower number of rollbacks, but there seems to be no overall locking time benefits. On the contrary, as the number of threads exceeds the number of cores, wound-wait appears to become increasingly more time-consuming than Wait-Die. One of the reason for this might be that Wound-Wait may see an increased number of unlocks per rollback. Another is that it is not trivial to find a good lock to wait for with Wound-Wait. With Wait-Die the thread rolling back just waits for the contended lock. With wound-wait the wounded thread is preempted, and in my implementation I choose to lazy-preempt at the next blocking lock, so that at least we have a lock to wait on, even if it's not a relevant lock to trigger a rollback.

So this raises a couple of questions:

1) Should we implement an upstream version of batching locks, perhaps as a choice on a per-lock-class basis? 2) Should we add a *real* wound-wait choice to our wound-wait mutexes. Otherwise perhaps rename them or document that they're actually doing wait-die.

/Thomas



_______________________________________________
dri-devel mailing list
dri-devel@xxxxxxxxxxxxxxxxxxxxx
https://lists.freedesktop.org/mailman/listinfo/dri-devel




[Index of Archives]     [Linux DRI Users]     [Linux Intel Graphics]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Yosemite News]     [Linux Kernel]     [Linux SCSI]     [XFree86]     [Linux USB Devel]     [Video for Linux]     [Linux Audio Users]     [Linux Kernel]     [Linux SCSI]     [XFree86]
  Powered by Linux