2012/11/2, Al Viro <viro@xxxxxxxxxxxxxxxxxx>: > On Fri, Nov 02, 2012 at 12:51:36AM +0900, Namjae Jeon wrote: >> When a user is monitoring an FS_UMOUNT watch using the inotify framework, >> there can be a potential race condition between the umount path & >> inotify_rm_watch. This scenario can be like- >> ================================================================= >> user does the following calls- >> fd = inotify_init(); >> inotify_add_watch(path, IN_UNMOUNT); /* added a watch to path*/ >> read(fd); /* wait for the event */ >> inotify_rm_watch(); /* as soon as event came, remove the >> watch tag from the selected path */ >> >> Now as we trigger the umount command on the above mentioned path, >> it will trigger a fsnotification for all the waiting inotify watches, >> and then userspace will find that event came, it does the required >> action and remove the watch. >> Now while watch is being removed, there is possibility that umount >> process is still under execution & in not complete and on the other >> path inotify_rm_watch() will call an iput() on the watched inode, >> then there can be a race whether the inode's superblock is valid >> at that moment or its gone. >> So some time we end up in this race very badly and we try to access >> the super_block which now not in a valid memory and kernel dies. Trace >> is like shown below >> >> 138.892000] [<c02f1050>] (do_raw_spin_trylock+0x0/0x54) from >> [<c043e590>] (__raw_spin_lock+0x34/0x90) >> [ 138.900000] [<c043e55c>] (__raw_spin_lock+0x0/0x90) from >> [<c043e604>] (_raw_spin_lock+0x18/0x1c) >> [ 138.908000] r5:000000bc r4:e3db94f0 >> [ 138.912000] [<c043e5ec>] (_raw_spin_lock+0x0/0x1c) from >> [<c01e3c94>] (fat_detach+0x3c/0x80) >> [ 138.920000] [<c01e3c58>] (fat_detach+0x0/0x80) from [<c01e3d40>] >> (fat_evict_inode+0x68/0x6c) >> [ 138.932000] r5:c0459cc0 r4:e3db94f0 >> [ 138.932000] [<c01e3cd8>] (fat_evict_inode+0x0/0x6c) from >> [<c0154f00>] (evict+0x94/0x154) >> [ 138.940000] r4:e3db94f0 r3:c01e3cd8 >> [ 138.944000] [<c0154e6c>] (evict+0x0/0x154) from [<c0155184>] >> (iput+0x17c/0x18c) >> [ 138.952000] r5:e3db9504 r4:e3db94f0 >> [ 138.956000] [<c0155008>] (iput+0x0/0x18c) from [<c0173ae4>] >> (fsnotify_destroy_mark+0x15c/0x19c) >> [ 138.964000] r6:e3db94f0 r5:e3017540 r4:e41882a0 r3:271aed08 >> [ 138.972000] [<c0173988>] (fsnotify_destroy_mark+0x0/0x19c) from >> [<c0175890>] (sys_inotify_rm_watch+0x88/0xc0) >> [ 138.980000] r8:c004aba8 r7:e3017540 r6:e41882a0 r5:00000000 >> r4:e3017300 >> [ 138.988000] r3:00000000 >> [ 138.988000] [<c0175808>] (sys_inotify_rm_watch+0x0/0xc0) from >> [<c004a920>] (ret_fast_syscall+0x0/0x30 >> >> So we can see inside the fat_detach function we are accessing illegal >> inode->i_sb->s_fs_info and we end up in above crash. >> ==================================================================== >> To solve this race, we must have some sort of serialized access to >> superblock structure in umount path and fsnotification path. Now since >> the umount path takes an exclusive lock on s_umount of superblock. >> So if umount is in progress, this lock will not be free. >> >> Hence we may try to take a shared access to super block's s_umount lock >> inside the inotify_rm_watch() & if lock is free, means no one is doing >> write operation on this superblock. So we can then just go ahead and >> then before calling iput on the concerned inode, first we should check >> whether this inode's superblock is still valid( e.g s_count is >= 1) or >> not. >> So based on this condition we can choose our action and prevent the race. > Hi. Al. > NAK. The bug is real, but proposed fix is broken. AFAICS, your variant > is going to deadlock if watch removal request comes in the middle of > umount, > when ->s_umount has already been taken. I did not understand, how there would be dead lock between umount and inotify_rm_watch(). lets us suppose umount is inside the generic_shutdown_super() with holding s_count, and if in between inotify_rm_watch() comes, it will try to do a down_read(s_umount) and will be blocked until generic_shutdown_super exists and while exiting it will do up_write() and inotify_rm_watch() will get the lock. Please let me know if i am missing some thing here. > > Moreover, ->s_count doesn't do what you seem to think it does. If you ever > see a superblock with ->s_count being 0, you've already lost. It might've > been dereferencing already kfree()'d memory, for all you know. And yes, there might be a situation,that we might end up in dereferencing the kfree()'s memory, which is not good. Let me think more.. Thanks for review!. > -- To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html