ebiederm@xxxxxxxxxxxx (Eric W. Biederman) writes: > There are a lot of ways we could approach this, and I sat down and wrote > the simplest variant I could think of, so hopefully there are not silly > bugs that get overlooked. > > The code move all cleanup from mntput that would do filesystem I/O into > work queues. > > The code waits in mntput for that I/O to complete if we wait today. > > The code has been tested and it works, and it succeeds in running > deactivate_super in a stack from with just short of 7500 bytes free. > 7500 bytes should be enough for anybody! > > I want to double check that I am using work queues correctly, they used > to be deadlock prone, and sleep on everything before I commit to a final > version but this version is probably good enough. I have confirmed by emperical testing, reading the code and with lockdep that moving the work of mntput onto the system_wq does not introduce deadlocks. I have read through all of fsnotify_vfsmount_delete looking for anything that might be affected by moving fsnotify_vfsmount_delete into another context and thankfully there is nothing. At most audit will log the removal of a rule, and inotify will log a removal. But in neither case is there any information taken from the context of the current process. > I may be going overboard in the case where we auto close BSD accounting > files, but at this point I figure better safe than sorry. After sleeping on the code I realized that putting a union in mount.h to keep the size the same as bit to clever. Otherwise I am satisfied with the code and have tested it every way I can think of. I have looked a bit more at the issue of removing the blocking from mntput and it looks this set of changes can easily serve as an intermediate step to get there. Certainly the suggested flush_pending_mntput(fs_type) which is probably better called flush_filesystem(fs_type) is straight forward to implement. The code just needs to wait for fs_type->fs_supers to become an empty list, after walking fs_supers->s_mounts and force a lazy unmount on all mounts of the filesystem. A few stray thoughts, in modules filesystems and error handling. - kern_mount from a module is nasty because rmmod is only supported if (struct file_system_type).owner == NULL. So unlike other filesystems which we know can not be in use during module unload there are no programatic guarantees about a kernel mount. - Today a module that calls register_filestem or kern_mount and does anything else afterward that might fail is problematic. We have nothing that would flush or otherwise force a registered filesystem to be unmounted. Ouch. kern_mount for anything except debugging is in a similar situation. Presumably the filesystem initialization will make something available to the kernel at large that will have access to the mount point. At which point we can have filesystem reference counts that we can not statically account for and we don't have the code to flush them. Which is a long way of saying it would probably be a good idea to write and use flush_filesystem even if we never move to a mntput that doesn't sleep. Eric -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html