Re: [PATCH review 11/11] mnt: Honor MNT_LOCKED when detaching mounts

ebiederm@xxxxxxxxxxxx (Eric W. Biederman) · Fri, 09 Jan 2015 23:32:47 -0600

Al Viro <viro@xxxxxxxxxxxxxxxxxx> writes:

> On Thu, Jan 08, 2015 at 10:32:13PM +0000, Al Viro wrote:
>> 	a) easily expressed with that scheme (behaviour is almost identical
>> to what your series does, except that we'd empty mnt_child right in
>> mntput_no_expire(), rather than keeping it until cleanup_mnt()) and
>> 	b) easily generalized to MNT_DETACH with lazy dissolving; in fact,
>> the only difference would be to treat *all* submounts as "don't put on
>> global list, don't remove from child list", not just the MNT_LOCKED ones.
>> 
>> I'll play around with that today and tomorrow; hopefully, I'll have a postable
>> variant by the weekend...
>
> Hmm...  Linus, what do you think of the following?

You are just about defining a mutex here.  The one difference is the
single use property that makes this a good cleanup primitive.

WQ_FLAG_EXCLUSIVE appears wrong.  We want to remove all of the waiters
from the list and wake the all up.  Otherwise if you have 3 different
kernel threads all trying to clean up the same structure at the
same time (say remount, mntput and acct_off) one of them will be left
waiting forever.

I don't believe rcu anything in this function itself buys you anything,
but structuring this primitive so that it can be called from an rcu list
traversal seems interesting.  

It might be simpler to have a single fs_pin_mutex and that guards all of
the paths to a fs_pin so we don't have to be this clever.  But I expect
there are weird lock ordering issues that being this smart allows us to
avoid.  Certainly a primitive that allows us to clean things up without
any locks held is easier to maintain.

> struct foo {
> 	int done;
> 	wait_queue_head_t wait;
> 	...
> };
>
> void kill_foo(struct foo *p)
> {
>         wait_queue_t wait;
> 	wait.flags = WQ_FLAG_EXCLUSIVE;
> 	wait.private = current;
> 	wait.func = autoremove_wake_function;
> 	spin_lock_irq(&p->wait.lock);
> 	if (likely(!p->done)) {
> 		p->done = -1;
> 		spin_unlock_irq(&p->wait.lock);
> 		rcu_read_unlock();
> 		/* do cleanup */
> 		...
> 		/* remove references to *p */
> 		...
> 		spin_lock_irq(&p->wait.lock);
> 		p->done = 1;	
> 		wake_up_locked(&p->wait);
> 		spin_unlock_irq(&p->wait.lock);
> 		/* RCU-schedule freeing of p */
> 		...
> 		return;
> 	}
> 	if (p->done > 0) {
> 		spin_unlock_irq(&p->wait.lock);
> 		rcu_read_unlock();
> 		return;
> 	}
> 	__add_wait_queue_tail(&p->wait, &wait);
> 	while (1) {
> 		set_current_state(TASK_UNINITERRUPTIBLE);
> 		spin_unlock_irq(&p->wait.lock);
> 		rcu_read_unlock();
> 		schedule();
> 		rcu_read_lock();
> 		if (likely(list_empty(&wait.task_list)))
> 			break;
> 		/* OK, we know p couldn't have been freed yet */
> 		spin_lock_irq(&p->wait.lock);
> 		if (p->done > 0) {
> 			spin_unlock_irq(&p->wait.lock);
> 			break;
> 		}
> 	}
> 	rcu_read_unlock();
> }
>
> AFAICS, that ought to be safe - the first caller makes sure that everybody
> else will wait for it to finish, everybody else (ones who come via references
> before the first one has removed them) will be waiting for the first one
> to do wakeup *and* will take care not to touch the victim unless they knows
> it's still there.
>
> Do you see any problems with the above?  I wonder if I ended up open-coding
> something already existing in there...
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@xxxxxxxxxxxxxxx
More majordomo info at  http://vger.kernel.org/majordomo-info.html