On Thu, Apr 19, 2012 at 03:57:28PM -0400, Mimi Zohar wrote: > Has the discussion here moved from deferring the __fput() for the > mmap_sem/i_mutex lockdep side case, to taking the i_mutex in __fput() in > general? Lockdep has not reported any problems, other than for the > mmap_sem/i_mutex scenario. > > > This > > is not a way to go; that kind of kludges leads to locking code that is > > impossible to reason about. > > Are you referring to defering the __fput() or taking the i_mutex in > __fput() in general? I'm referring to having "are we holding ->mmap_sem" logics anywhere in fput(). Note that your "lockdep has not reported..." is a symptom of the same problem - it's *NOT* enough to show the lack of deadlocks from the change of locking rules. And after that change we'll get even worse situation, since proving the safety will become harder (and even more so if we end up adding other cases when we need to defer). Summary for those who'd been spared the earlier fun: IMA stuff wants to grab ->i_mutex on the final fput() in some cases. That violates the locking order in at least one way - final fput() may come under ->mmap_sem, in which case grabbing ->i_mutex would be a Bloody Bad Idea(tm). "Solution" proposed: a bit of spectaculary ugly logics checking in final fput() whether we have ->mmap_sem locked. If we do, said final fput() becomes non-final and is done asynchronously. This is fundamentally flawed, of course, since "is ->mmap_sem unlocked" is used as "is it safe to have ->i_mutex grabbed", and it's both unproven and brittle as hell even if it happens to be true right now (and I wouldn't bet on that, TBH). If it had been IMA alone, I would've cheerfully told them where to stuff the whole thing. Unfortunately, fput() *is* lock-heavy even without them. Note that it can easily end up triggering such things as final deactivate_super(). Now, ->mmap_sem is irrelevant here - after all, any inodes involved won't be mmapped, or deactivate_super() wouldn't be final. However, fput() is called e.g. under rtnl_lock() and I'm not at all sure that something like NFS won't try to grab it from its ->kill_sb(). So the question is, do we have any reasonable way to get to the situation where the __fput() would only be done without any locks held? It might be worth trying. What we CAN'T do: * defer all __fput() (i.e. always do final fput() async). Obviously no-go, for performance reasons alone. * check some predicate about the set of locks held and defer if it's false. That's what IMA folks have tried to do; no-go for the reasons described above. * add a new helper (fput_carefully() or something like that) that would defer final __fput(), leaving fput() with the current behaviour. And convert the potentially unsafe callers to it (starting with everything that holds ->mmap_sem). No-go since it's not maintainable - a change pretty far away from a function that does (currently safe) fput() can end up requiring the conversion to fput_carefully(). Too easy to miss, so this will decay and it won't be easy to verify correctness several years down the road. However, there's an approach that might be feasible. Most of the time the final fput() *is* done without any locks held and there's a very large subclass of those call sites - those that come via fput_light(). What we could do, and what might be maintainable is: * prohibit fput_light() with locks held. Right now we are very close to that (or already there - I haven't finished checking). * convert low-hanging fget/fput in syscalls to fget_light/fput_light. Makes sense anyway. * split off fput_nodefer() (identical to what fput() is right now), make fput_light() call it instead of fput(). Switch path_lookupat() and path_openat() to fput_nodefer() as well. Ditto for sys_socketpair() and sys_accept4(). * make fput() (now almost never leading to __fput()) defer the sucker in very rare cases when it ends up dropping the last reference. At that point we would have __fput() always done without any locks held, which would remove all restrictions on the locks that can be taken from it. Comments? Disclaimer: I have not finished going through the call sites (note that there are wrappers - e.g. sockfd_put() *and* sockfd_lookup()), so there might be obstacles. In particular, I'm still not sure about SCM_RIGHTS datagrams handling - IIRC, we had an interesting kludge in there, with a kinda-sorta deferral/batching. BTW, I wonder about deadlocks around that one - drivers/vhost/net.c is doing ->recvmsg() while holding vq->mutex and if an SCM_RIGHTS datagram comes in, we'll get a bunch of fput() done. Which can lead to final umount of a network filesystem, etc. If that thing can lead to handle_rx() on the same sucker, we have a deadlock... -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html