On Mon, Dec 07, 2015 at 10:49:05AM -0800, Linus Torvalds wrote: > On Mon, Dec 7, 2015 at 10:05 AM, Suzuki K. Poulose > <suzuki.poulose@xxxxxxx> wrote: > > blkdev_open() doesn't release the bdev, it attached to a given > > inode, if blkdev_get() fails (e.g, due to absence of a device). > > This can cause kernel crashes when the original filesystem > > tries to flush the data during evict_inode. > > Ugh. This code is a mess. Al, can you please comment? > > So what happens is that when "blkdev_get()" fails, it will do a > bdput() on the bdev. Yes. > But blkdev_open() hasn't done a bdget(). It's done a bd_acquire(). > Which will do the whole "add inodes to bd_inodes". Yes. > And yes, > bd_forget() will undo that. It would, but there's no reason to drop the cached pointer to bdev. > IOW, the path looks simple and apparently fixes an oops, but I'd like > much more of an explanation for what happens, because it all feels > wrong to me. Why doesn't the bdput() end up undoing the bd_acquire() > properly? Because it doesn't work that way. ->i_bdev is just a cached result of lookup by device number. Once found, it stays there for as long as neither the struct inode nor struct block_device are freed. It does *NOT* pin struct block_device. Note that we have two kinds of block device inodes - ones coallocated with struct block_device (those are unique per major/minor, live on bdevfs and can't be seen directly) and ones aliasing the first kind. Those live on normal filesystems. Pagecache lives in ->i_data of the bdevfs inode; aliasing ones have their ->i_data empty and ->i_mapping pointing to ->i_data of the bdevfs inode. That guarantees the cache coherency between those guys. Now, simply having ->i_bdev point to struct block_device does not affect the lifetime of the latter in any way. All aliases are dissociated from block_device when bdevfs inode is evicted; block_device is dissociated from aliasing inode when that aliasing inode is evicted. bdev_lock provides the atomicity there. _Opening_ an alias (any of them) does pin block_device down. So when an aliasing inode is open, we can safely use its ->i_mapping in normal pagecache-related code and have everything work correctly. Accessing ->i_mapping when inode isn't open is valid only if filesystem code is sure it's pointing to its own ->i_data (and pointless in any case). And that's what 9p ->evict_inode() is doing - it's trying to evict not the pages in its ->i_data (which would be empty for block device), but the pages in its ->i_mapping. IOW, the pagecache shared by all aliasing inodes. Which is obviously bogus, regardless of lifetime rules violation - mknod /tmp/foo b 8 1 && dd count=1 </tmp/foo >/dev/null && rm /tmp/foo should not blow the cache of /dev/sda1, no matter which fs type we happen to use for /tmp. And 9p will try to do just that. Fortunately, no other ->evict_inode() instance is doing anything of that sort, so we just need to fix that bogosity in 9p one. As for the bdev eviction, bdput() acts exactly like iput(). In fact, it is iput() of the coallocated bdevfs inode. It can stay around with zero refcount; same as any other inode, memory pressure would eventually push them out. It does *NOT* pin the driver when not opened, BTW. Anyway, the fix for 9p bogosity follows; it definitely fixes a bug there, and I'm fairly sure that it fixes the bug that had been reported. A confirmation would be nice, of course... Signed-off-by: Al Viro <viro@xxxxxxxxxxxxxxxxxx> --- diff --git a/fs/9p/vfs_inode.c b/fs/9p/vfs_inode.c index 699941e..5110785 100644 --- a/fs/9p/vfs_inode.c +++ b/fs/9p/vfs_inode.c @@ -451,9 +451,9 @@ void v9fs_evict_inode(struct inode *inode) { struct v9fs_inode *v9inode = V9FS_I(inode); - truncate_inode_pages_final(inode->i_mapping); + truncate_inode_pages_final(&inode->i_data); clear_inode(inode); - filemap_fdatawrite(inode->i_mapping); + filemap_fdatawrite(&inode->i_data); v9fs_cache_inode_put_cookie(inode); /* clunk the fid stashed in writeback_fid */ -- To unsubscribe from this list: send the line "unsubscribe stable" in the body of a message to majordomo@xxxxxxxxxxxxxxx More majordomo info at http://vger.kernel.org/majordomo-info.html