On Mon, Oct 25, 2021 at 05:22:18PM +0200, Sebastian Andrzej Siewior wrote: > The MNT_WRITE_HOLD flag is used to hold back any new writers while the > mount point is about to be made read-only. __mnt_want_write() then loops > with disabled preemption until this flag disappears. Callers of > mnt_hold_writers() (which sets the flag) hold the spinlock_t of > mount_lock (seqlock_t) which disables preemption on !PREEMPT_RT and > ensures the task is not scheduled away so that the spinning side spins > for a long time. > > On PREEMPT_RT the spinlock_t does not disable preemption and so it is > possible that the task setting MNT_WRITE_HOLD is preempted by task with > higher priority which then spins infinitely waiting for MNT_WRITE_HOLD > to get removed. > > Acquire mount_lock::lock which is held by setter of MNT_WRITE_HOLD. This > will PI-boost the owner and wait until the lock is dropped and which > means that MNT_WRITE_HOLD is cleared again. > > Signed-off-by: Sebastian Andrzej Siewior <bigeasy@xxxxxxxxxxxxx> > --- I'm not an expert in real-time locking but this solution is way less intrusive and easier to explain and understand than the first version. Based on how I understand priority inheritance this solution seems good. The scenario that we seem to mostly worry is: Let's assume a low-priority task A is holding lock_mount_hash() and writes MNT_WRITE_HOLD to mnt->mnt_flags. Another high-priority task B is spinning waiting for MNT_WRITE_HOLD to be cleared. On rt kernels task A could be scheduled away causing the high-priority task B to spin for a long time. However, if we force the high-priority task B to grab lock_mount_hash() we thereby priority boost low-priorty task A causing it to relinquish the lock quickly preventing task B from spinning for a long time. Under the assumption I got this right: Acked-by: Christian Brauner <christian.brauner@xxxxxxxxxx> > fs/namespace.c | 20 ++++++++++++++++++-- > 1 file changed, 18 insertions(+), 2 deletions(-) > > diff --git a/fs/namespace.c b/fs/namespace.c > index 659a8f39c61af..3ab45b47b2860 100644 > --- a/fs/namespace.c > +++ b/fs/namespace.c > @@ -343,8 +343,24 @@ int __mnt_want_write(struct vfsmount *m) > * incremented count after it has set MNT_WRITE_HOLD. > */ > smp_mb(); > - while (READ_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD) > - cpu_relax(); > + might_lock(&mount_lock.lock); > + while (READ_ONCE(mnt->mnt.mnt_flags) & MNT_WRITE_HOLD) { > + if (!IS_ENABLED(CONFIG_PREEMPT_RT)) { > + cpu_relax(); IS_ENABLED will have the same effect as using ifdef, i.e. compiling the irrelevant branch out, I hope. > + } else { > + /* > + * This prevents priority inversion, if the task > + * setting MNT_WRITE_HOLD got preempted on a remote > + * CPU, and it prevents life lock if the task setting > + * MNT_WRITE_HOLD has a lower priority and is bound to > + * the same CPU as the task that is spinning here. > + */ > + preempt_enable(); > + lock_mount_hash(); > + unlock_mount_hash(); > + preempt_disable(); > + } > + }