On Tuesday, 16 April 2024 11:19:17 CDT Peter Zijlstra wrote: > On Tue, Apr 16, 2024 at 05:53:45PM +0200, Peter Zijlstra wrote: > > On Tue, Apr 16, 2024 at 05:50:14PM +0200, Peter Zijlstra wrote: > > > On Tue, Apr 16, 2024 at 10:14:21AM +0200, Peter Zijlstra wrote: > > > > > Some aspects of the implementation may deserve particular comment: > > > > > > > > > > * In the interest of performance, each object is governed only by a > > > > > single > > > > > > > > > > spinlock. However, NTSYNC_IOC_WAIT_ALL requires that the state of > > > > > multiple > > > > > objects be changed as a single atomic operation. In order to > > > > > achieve this, we first take a device-wide lock ("wait_all_lock") > > > > > any time we are going to lock more than one object at a time. > > > > > > > > > > The maximum number of objects that can be used in a vectored wait, > > > > > and > > > > > therefore the maximum that can be locked simultaneously, is 64. > > > > > This number is NT's own limit. > > > > > > AFAICT: > > > spin_lock(&dev->wait_all_lock); > > > > > > list_for_each_entry(entry, &obj->all_waiters, node) > > > > > > for (i=0; i<count; i++) > > > > > > spin_lock_nest_lock(q->entries[i].obj->lock, > > > &dev->wait_all_lock); > > > > > > Where @count <= NTSYNC_MAX_WAIT_COUNT. > > > > > > So while this nests at most 65 spinlocks, there is no actual bound on > > > the amount of nested lock sections in total. That is, all_waiters list > > > can be grown without limits. > > > > > > Can we pretty please make wait_all_lock a mutex ? That should be fine, at least. > > Hurmph, it's worse, you do that list walk while holding some obj->lock > > spinlokc too. Still need to figure out how all that works.... > > So the point of having that other lock around is so that things like: > > try_wake_all_obj(dev, sem) > try_wake_any_sem(sem) > > are done under the same lock? The point of having the other lock around is that try_wake_all() needs to lock multiple objects at the same time. It's a way of avoiding lock inversion. Consider task A does a wait-for-all on objects X, Y, Z. Then task B signals Y, so we do try_wake_all_obj() on Y, which does try_wake_all() on A's queue entry; that needs to check X and Z and consume the state of all three objects atomically. Another task could be trying to signal Z at the same time and could hit a task waiting on Z, Y, X, and that causes inversion. The simple and easy way to implement everything is just to have a global lock on the whole device, but this is kind of known to be a performance bottleneck (this was NT's BKL, and they ditched it starting with Vista or 7 or something). Instead we use a lock per object, and normally in the wait-for-any case we only ever need to grab one lock at a time, but when we need to do a wait-for- all we need to lock multiple objects at once, and we grab the outer lock to avoid potential lock inversion. > Where I seem to note that both those functions do that same list > iteration. Over different lists. I don't know if there's a better way to name things to make that clearer. There's the "any" wait queue, which tasks which do a wait-for-any add themselves to, and the "all" wait queue, which tasks that do a wait-for-all add themselves to. Signaling an object could potentially wake up either one, but checking whether a task is eligible is a different process.